High throughput genome sequencing on DNA arrays

ABSTRACT

The present invention is directed to methods and compositions for acquiring nucleotide sequence information of target sequences using adaptors interspersed in target polynucleotides. The sequence information can be new, e.g. sequencing unknown nucleic acids, re-sequencing, or genotyping. The invention preferably includes methods for inserting a plurality of adaptors at spaced locations within a target polynucleotide or a fragment of a polynucleotide. Such adaptors may serve as platforms for interrogating adjacent sequences using various sequencing chemistries, such as those that identify nucleotides by primer extension, probe ligation, and the like. Encompassed in the invention are methods and compositions for the insertion of known adaptor sequences into target sequences, such that there is an interruption of contiguous target sequence with the adaptors. By sequencing both “upstream” and “downstream” of the adaptors, identification of entire target sequences may be accomplished.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.11/679,124, filed Feb. 26, 2007 which claims priority to provisionalapplication Ser. No. 60/776,415, filed Feb. 24, 2006 and Ser. No.60/821,960 filed Aug. 10, 2006.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

This application has been partially funded by the Federal Governmentthrough Grant No. 1 U01 A1057315-01 of the National Institute of Health.

BACKGROUND OF THE INVENTION

Large-scale sequence analysis of genomic DNA is central to understandinga wide range of biological phenomena related to states of health anddisease both in humans and in many economically important plants andanimals, e.g. Collins et al (2003), Nature, 422: 835-847; Service,Science, 311: 1544-1546 (2006); Hirschhorn et al (2005), Nature ReviewsGenetics, 6: 95-108; National Cancer Institute, Report of Working Groupon Biomedical Technology, “Recommendation for a Human Cancer GenomeProject,” (February, 2005); Tringe et al (2005), Nature ReviewsGenetics, 6: 805-814. The need for low-cost high-throughput sequencingand re-sequencing has led to the development of several new approachesthat employ parallel analysis of many target DNA fragmentssimultaneously, e.g. Margulies et al, Nature, 437: 376-380 (2005);Shendure et al (2005), Science, 309: 1728-1732; Metzker (2005), GenomeResearch, 15: 1767-1776; Shendure et al (2004), Nature Reviews Genetics,5: 335-344; Lapidus et al, U.S. patent publication US 2006/0024711;Drmanac et al, U.S. patent publication US 2005/0191656; Brenner et al,Nature Biotechnology, 18: 630-634 (2000); and the like. Such approachesreflect a variety of solutions for increasing target polynucleotidedensity in planar arrays and for obtaining increasing amounts ofsequence information within each cycle of a particular sequencedetection chemistry. Most of these new approaches are restricted todetermining a few tens of nucleotides before signals becomesignificantly degraded, thereby placing a limit on overall sequencingefficiency.

Another limitation of traditional high-throughput sequencing techniquesis that random positioning of DNA targets over an array surface, whichis used in many sequencing methods, reduces the packing efficiency ofthose targets from what is possible by attaching DNA at predefined sitessuch as in a grid.

In view of such limitations, it would be advantageous for the field ifan additional approach were available to increase the amount ofsequencing information that could be obtained from an array of targetpolynucleotides. Another need in the art is for an efficient andinexpensive way to prepare array supports with billions of binding sitesat submicron sizes and distances.

SUMMARY OF THE INVENTION

Accordingly, in one aspect, the invention addresses the problemsassociated with short sequence read-lengths produced by many approachesto large-scale DNA sequencing, including the problem of obtaininglimited sequence information per enzymatic cycle. Also provided aremethods and compositions for preparing random arrays of engineerednucleic acid molecules able to support billions of molecules, includingmolecules at submicron sizes and distances.

In one aspect, the invention provides a method of determining theidentification of a first nucleotide at a detection position of a targetsequence, wherein the target sequence comprises a plurality of detectionpositions. In a preferred aspect, the method includes two steps:providing a plurality of concatemers and identifying the firstnucleotide. Each concatemer comprises a plurality of monomers, and eachmonomer comprises: (i) a first target domain of the target sequencecomprising a first set of target detection positions; (ii) a firstadaptor comprising a Type IIs endonuclease restriction site; (iii) asecond target domain of the target sequence comprising a second set oftarget detection positions; and (iv) a second interspersed adaptorcomprising a Type IIs endonuclease restriction site. In a preferredembodiment, the target sequence concatemers are immobilized on asurface. In a further embodiment, the surface is functionalized.

In one embodiment, the invention provides a method of determining theidentification of a first nucleotide at a detection position of a targetsequence in which the identifying step comprises contacting theconcatemers with a set of sequencing probes. In an exemplary embodiment,the sequencing probes each comprise a first domain complementary to oneof the adaptors, a unique nucleotide at a first interrogation position,and a label. In a preferred embodiment, the contact between theconcatemers and the sequencing probes is accomplished under conditionssuch that if the unique nucleotide is complementary to the firstnucleotide, a sequencing probe hybridizes to the concatemer, therebyidentifying the first nucleotide.

In another embodiment, each adaptor comprises an anchor probe, ahybridization site and an identifying step. The identifying step in anexemplary embodiment comprises: hybridizing anchor probes to anchorprobe hybridization sites, hybridizing sequencing probes to targetdetection positions adjacent to the adaptors, ligating adjacenthybridized sequencing and anchor probes to form ligated probes, anddetecting the ligated probes to identify the first nucleotide.

In another embodiment, each adaptor comprises an anchor probehybridization site, and the identifying step comprises hybridizinganchor probes to the anchor probe hybridization sites and adding apolymerase and at least one dNTP comprising a label. The polymerase andthe at least on dNTP are added under conditions whereby if the dNTP isperfectly complementary to a detection position, the dNTP is added tothe anchor probe to form an extended probe, thereby creating aninterrogation position of the extended probe. The first nucleotide isidentified by determining the nucleotide at the interrogation positionof the extended probe.

In a further embodiment of the invention, a nucleotide at a seconddetection position is identified. In still further embodiments of theinvention, nucleotides at a third detection position, at a fourthdetection position, at a fifth detection position, and/or at a sixthdetection position is identified.

In one embodiment, the invention provides a method of determining theidentification of a first nucleotide at a detection position of a targetsequence, wherein the target sequence the target sequence concatemersare immobilized on a surface, and that surface comprises functionalmoieties including but not limited to amines, silanes, and hydroxyls. Ina further embodiment, the surface comprises a plurality of spatiallydistinct regions comprising said immobilized concatemers. In a stillfurther embodiment, the concatemers are immobilized on the surface usingcapture probes.

In one aspect, the invention provides a substrate comprising a pluralityof immobilized concatemers, each monomer of said concatemer comprising:a first target sequence, a first adaptor comprising a Type IIsendonuclease restriction site, a second target sequence, and a secondinterspersed adaptor comprising a Type IIs endonuclease restrictionsite. The Type IIs endonuclease restriction site of the first adaptormay or may not be the same as the Type IIs endonuclease restriction siteof the second adaptor. In a further embodiment, each monomer furthercomprises a third target sequence and a third interspersed adaptorcomprising a Type IIs endonuclease restriction site, and in a stillfurther embodiment, each monomer further comprises a fourth targetsequence and a fourth interspersed adaptor comprising a Type IIsendonuclease restriction site.

In another aspect, the invention provides methods for inserting multipleadaptors in a target sequence. In a preferred aspect, the methodincludes the steps of: (i) ligating a first adaptor to one terminus ofsaid target sequence, wherein the adaptor comprises a binding site for arestriction enzyme; circularizing the product from step (i) to create afirst circular polynucleotide; cleaving the circular polynucleotide witha restriction enzyme, wherein the restriction enzyme is able to bind tothe binding site within the first adaptor; ligating a second adaptor,wherein said second adaptor comprises a binding site for a restrictionenzyme; and circularizing the product from step (iv) to create a secondcircular polynucleotide. In some embodiments, steps (iii) through (v)are repeated to insert a desired number of adaptors in the targetsequence. In a preferred embodiment, the circularization step comprisesadding a CircLigase™ enzyme.

In another embodiment, the circularization step comprises adding acircularization sequence to a second terminus of the target sequence,hybridizing a bridge template to at least a portion of the adaptor and aportion of the circularization sequence, and ligating the first andsecond termini together to circularize the target sequence.

In another aspect, the invention provides a method for identifying anucleotide sequence of a target sequence. In this method, a plurality ofinterspersed adaptors is provided within the target sequence, and eachinterspersed adaptors has at least one boundary with the targetsequence. At least one nucleotide adjacent to at least one boundary ofat least two interspersed adaptors is identified, thereby identifyingthe nucleotide sequence of the target sequence.

In yet another aspect, the invention provides a library ofpolynucleotides. In a preferred aspect, the library comprises more thanone nucleic acid fragment, and each fragment comprises a plurality ofinterspersed adaptors in a predetermined order. Each interspersedadaptor has at least one end that comprises a sequence which is not ableto cross-hybridize with other sequences of other interspersed adaptorsof the plurality. In a further preferred aspect, the predetermined orderof interspersed adaptors is identical for every nucleic acid fragment.

In one aspect, the invention provides a method for identifying anucleotide sequence of a target polynucleotide which comprises the stepsof generating an amplicon from each of a plurality of fragments of thetarget polynucleotide and forming a random array of the amplicons,hybridizing one or more sequencing probes to the random array,determining the identity of at least one nucleotide adjacent to at leastone interspersed adaptor by extending the one or more sequencing probesin a sequence specific reaction, and repeating the hybridization andidentifying steps until a nucleotide sequence of the targetpolynucleotide is identified. In a preferred aspect, the sequencingprobes are hybridized to the random array under conditions that permitthe formation of perfectly matched duplexes between the one or moreprobes and complementary sequences on interspersed adaptors. In apreferred aspect, each fragment contains a plurality of interspersedadaptors at predetermined sites. In a further aspect, each ampliconcomprises multiple copies of a fragment in numbers such that thefragments substantially cover the target polynucleotide. In a stillfurther aspect, the amplicons of the random array are fixed to a surfaceat a density such that at least a majority of the amplicons is opticallyresolvable.

In another aspect, the invention provides a method of identifying anucleotide sequence of a target sequence which comprises the steps ofproviding a random array of concatemers, hybridizing one or more probesfrom a first set of probes to the random array, hybridizing one or moreprobes from a second set of probes to the random array, ligating probesform the first and second sets which are hybridized to a targetconcatemer at contiguous sites, identifying the sequences of the ligatedfirst and second probes, and repeating the hybridizing, ligating andidentifying steps until the sequence of the target sequence isidentified. In a preferred aspect, the random array of concatemerscomprises concatemers fixed to a planar surface having an array ofoptically resolvable discrete spaced apart regions, and each concatemercomprises multiple copies of a fragment of the target polynucleotide,such that the number of different concatemers is such that theirrespective fragments substantially cover the target sequence. In afurther aspect, each discrete spaced apart region has an area of lessthan 1 μm², such that substantially all the discrete spaced apartregions have at most one concatemer attached.

In still another aspect, the invention provides a method of identifyinga nucleotide sequence of a target sequence which comprises generating aplurality of concatemers comprising multiple copies of a fragment of thetarget sequence, forming a random array of the concatemers fixed to asurface at a density such that at least a majority of the concatemersare optically resolvable, and identifying a sequence of at least aportion of each fragment adjacent to at least one interspersed adaptorin at least one concatemer, thereby identifying the nucleotide sequenceof the target sequence.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A-1G illustrate the invention and applications thereof.

FIGS. 2A-2G illustrate various methods of inserting adaptors in anucleic acid fragment to produce a target polynucleotide containinginterspersed adaptors.

FIGS. 3A-3E illustrate a method of high-throughput sequencing that canbe implemented on target polynucleotides containing interspersedadaptors.

FIG. 4 provides a comparison of structured and standard random DNAarrays made by attaching RCR products.

FIG. 5 illustrates reference patterns on an ordered array.

FIG. 6 shows random arrays imaged on a rSBH instrument.

FIG. 7 shows three array images overlaid with slight shifts for easierviewing.

FIG. 8 shows five array images overlaid with slight shifts.

FIG. 9 shows five array images overlaid with slight shifts.

FIG. 10 shows an image of an array in which lines of capture probeacross the surface of the coverslip were used to specifically bind toDNBs.

DETAILED DESCRIPTION OF THE INVENTION

The practice of the present invention may employ, unless otherwiseindicated, conventional techniques and descriptions of organicchemistry, polymer technology, molecular biology (including recombinanttechniques), cell biology, biochemistry, and immunology, which arewithin the skill of the art. Such conventional techniques includepolymer array synthesis, hybridization, ligation, and detection ofhybridization using a label. Specific illustrations of suitabletechniques can be had by reference to the example herein below. However,other equivalent conventional procedures can, of course, also be used.Such conventional techniques and descriptions can be found in standardlaboratory manuals such as Genome Analysis: A Laboratory Manual Series(Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: A LaboratoryManual, PCR Primer: A Laboratory Manual, and Molecular Cloning: ALaboratory Manual (all from Cold Spring Harbor Laboratory Press),Stryer, L. (1995) Biochemistry (4th Ed.) Freeman, New York, Gait,“Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press,London, Nelson and Cox (2000), Lehninger, Principles of Biochemistry3^(rd) Ed., W. H. Freeman Pub., New York, N.Y. and Berg et al. (2002)Biochemistry, 5^(th) Ed., W. H. Freeman Pub., New York, N.Y., all ofwhich are herein incorporated in their entirety by reference for allpurposes.

Overview

The present invention is directed to methods and compositions foracquiring nucleotide sequence information of target sequences (alsoreferred to herein as “target polynucleotides”) using adaptorsinterspersed in target polynucleotides. The sequence information can benew, e.g. sequencing unknown nucleic acids, resequencing, or genotyping.The invention preferably includes methods for inserting a plurality ofadaptors at spaced locations within a target polynucleotide or afragment of a polynucleotide. Such adaptors are referred to herein as“interspersed adaptors”, and may serve as platforms for interrogatingadjacent sequences using various sequencing chemistries, such as thosethat identify nucleotides by primer extension, probe ligation, and thelike. That is, one unique component of some embodiments of the inventionis the insertion of known adaptor sequences into target sequences, suchthat there is an interruption of contiguous target sequence with theadaptors. By sequencing both “upstream” and “downstream” of the adaptor,sequence information of entire target sequences may be accomplished.

Accordingly, without limitation, the inventions can generally bedescribed as follows (it should be noted that genomic DNA is used as anexample herein, but is not meant to be limiting). Genomic DNA from anyorganism is isolated and fragmented into target sequences using standardtechniques. A first adaptor is ligated to one terminus of the targetsequence. The adaptor preferably comprises a Type IIs restrictionendonuclease site, which cuts outside of the recognition sequence. Ifthe enzyme results in a “sticky” end, the overhang portion can either befilled in or removed.

In one embodiment, an enzyme is used to ligate the two ends of thelinear strand comprising the adaptor and the target sequence to form acircularized nucleic acid. This may be done using a single step.Alternatively, a second adaptor can be added to the other terminus ofthe target sequence (for example, a polyA tail), and then a bridgingsequence can be hybridized to the two adaptors, followed by ligation. Ineither embodiment, a circular sequence is formed.

The circular sequence is then cut with the Type IIs endonuclease,resulting in a linear strand, and the process is repeated. This resultsin a circular sequence with adaptors interspersed at well definedlocations within previously contiguous target sequences.

The circularized sequences are then amplified using a rolling circlereplication (RCR) reaction, to form concatemers of the original targetsequence (e.g. multimers of monomers). These long concatemers form “DNAnanoballs” (“DNBs”) can then optionally be immobilized on a surface in avariety of ways, as outlined below.

Once on the surface, using the known adaptor sequences, sequencing ofthe intervening target sequences is done. As is known in the art, thereare a number of techniques that can be used to detect or determine theidentity of a base at a particular location in a target nucleic acid,including, but not limited to, the use of temperature, competitivehybridization of perfect and imperfect probes to the target sequence,sequencing by synthesis, for example using single base extensiontechniques (sometimes referred to as “minisequencing”), theoligonucleotide ligase amplification (OLA) reaction, rolling circlereplication (RCR), allelic PCR, competitive hybridization and Invader™technologies. Preferred embodiments include sequencing by hybridizationwith ligation, and sequencing by hybridization.

The sequence information can then be used to reconstruct sequences oflarger target sequences, such as sequencing of the entire genomic DNA.

Sequencing large numbers of nucleic acids, as is necessary inapplications such as genome analysis, epidemiological studies, anddiagnostic tests, generally involves adapting sequencing technologies tohigh-throughput formats. However, there are drawbacks to traditionalhigh-throughput sequencing techniques, particularly the problem of shortsequence read lengths—that is, many high-throughput sequencingapproaches are limited in the length and type of target polynucleotidesthat may be successfully sequenced. This limitation is primarily due tothe number of contiguous bases that can be determined on a singlefragment in a single operation. By providing a plurality of sites ineach target polynucleotide or fragment from which to conduct particularsequencing chemistries, the present invention provides a multiplicity ofadjacent sequence reads. In one aspect, these adjacent reads arecontiguous, thereby effectively amplifying the expected read lengths ofa large class of sequencing chemistries.

The present invention thus allows the determination of a longercontiguous or almost contiguous target sequence by determining thesequences on either side of adaptors.

Compositions/Structures of Target Polynucleotides

Accordingly, the present invention provides compositions and methodsutilizing target sequences from samples. As will be appreciated by thosein the art, the sample solution may comprise any number of things,including, but not limited to, bodily fluids (including, but not limitedto, blood, urine, serum, lymph, saliva, anal and vaginal secretions,perspiration and semen) and cells of virtually any organism, withmammalian samples being preferred and human samples being particularlypreferred; environmental samples (including, but not limited to, air,agricultural, water and soil samples); biological warfare agent samples;research samples (i.e. in the case of nucleic acids, the sample may bethe products of an amplification reaction, including both target andsignal amplification, such as PCR amplification reactions; purifiedsamples, such as purified genomic DNA, RNA preparations, raw samples(bacteria, virus, genomic DNA, etc.); as will be appreciated by those inthe art, virtually any experimental manipulation may have been done onthe samples.

In general, cells from the target organism (animal, avian, mammalian,etc.) are used. When genomic DNA is used, the amount of genomic DNArequired for constructing arrays of the invention can vary widely. Inone aspect, for mammalian-sized genomes, fragments are generated from atleast about 10 genome-equivalents of DNA; and in another aspect,fragments are generated from at least about 30 genome-equivalents ofDNA; and in another aspect, fragments are generated from at least about60 genome-equivalents of DNA.

The target sequences or target polynucleotides are nucleic acids. By“nucleic acid” or “oligonucleotide” or grammatical equivalents hereinmeans at least two nucleotides covalently linked together. A nucleicacid of the present invention will generally contain phosphodiesterbonds, although in some cases, as outlined below (for example in theconstruction of primers and probes such as label probes), nucleic acidanalogs are included that may have alternate backbones, comprising, forexample, phosphoramide (Beaucage et al., Tetrahedron 49(10): 1925 (1993)and references therein; Letsinger, J. Org. Chem. 35:3800 (1970); Sprinzlet al., Eur. J. Biochem. 81:579 (1977); Letsinger et al., Nucl. AcidsRes. 14:3487 (1986); Sawai et al, Chem. Lett. 805 (1984), Letsinger etal., J. Am. Chem. Soc. 110:4470 (1988); and Pauwels et al., ChemicaScripta 26:141 91986)), phosphorothioate (Mag et al., Nucleic Acids Res.19:1437 (1991); and U.S. Pat. No. 5,644,048), phosphorodithioate (Briuet al., J. Am. Chem. Soc. 111:2321 (1989), O-methylphosphoroamiditelinkages (see Eckstein, Oligonucleotides and Analogues: A PracticalApproach, Oxford University Press), and peptide nucleic acid backbonesand linkages (see Egholm, J. Am. Chem. Soc. 114:1895 (1992); Meier etal., Chem. Int. Ed. Engl. 31:1008 (1992); Nielsen, Nature, 365:566(1993); Carlsson et al., Nature 380:207 (1996), all of which areincorporated by reference). Other analog nucleic acids include thosewith bicyclic structures including locked nucleic acids, Koshkin et al.,J. Am. Chem. Soc. 120:13252 3 (1998); positive backbones (Denpcy et al.,Proc. Natl. Acad. Sci. USA 92:6097 (1995); non-ionic backbones (U.S.Pat. Nos. 5,386,023, 5,637,684, 5,602,240, 5,216,141 and 4,469,863;Kiedrowshi et al., Angew. Chem. Intl. Ed. English 30:423 (1991);Letsinger et al., J. Am. Chem. Soc. 110:4470 (1988); Letsinger et al.,Nucleoside & Nucleotide 13:1597 (1994); Chapters 2 and 3, ASC SymposiumSeries 580, “Carbohydrate Modifications in Antisense Research”, Ed. Y.S. Sanghui and P. Dan Cook; Mesmaeker et al., Bioorganic & MedicinalChem. Lett. 4:395 (1994); Jeffs et al., J. Biomolecular NMR 34:17(1994); Tetrahedron Lett. 37:743 (1996)) and non-ribose backbones,including those described in U.S. Pat. Nos. 5,235,033 and 5,034,506, andChapters 6 and 7, ASC Symposium Series 580, “Carbohydrate Modificationsin Antisense Research”, Ed. Y. S. Sanghui and P. Dan Cook. Nucleic acidscontaining one or more carbocyclic sugars are also included within thedefinition of nucleic acids (see Jenkins et al., Chem. Soc. Rev. (1995)pp 169 176). Several nucleic acid analogs are described in Rawls, C & ENews Jun. 2, 1997 page 35. All of these references are hereby expresslyincorporated by reference. These modifications of the ribose-phosphatebackbone may be done to increase the stability and half-life of suchmolecules in physiological environments. For example, PNA:DNA hybridscan exhibit higher stability and thus may be used in some embodiments.

The nucleic acids may be single stranded or double stranded, asspecified, or contain portions of both double stranded or singlestranded sequence. The nucleic acids may be DNA, both genomic and cDNA,RNA or a hybrid, where the nucleic acid contains any combination ofdeoxyribo- and ribo-nucleotides, and any combination of bases, includinguracil, adenine, thymine, cytosine, guanine, inosine, xathaninehypoxathanine, isocytosine, isoguanine, etc.

The term “target sequence” or “target nucleic acid” or grammaticalequivalents herein means a nucleic acid sequence on a single strand ofnucleic acid. The target sequence may be a portion of a gene, aregulatory sequence, genomic DNA, cDNA, RNA including mRNA and rRNA, orothers. As is outlined herein, the target sequence may be a targetsequence from a sample, or a secondary target such as a product of anamplification reaction, etc. It may be any length.

As is outlined more fully below, probes are made to hybridize to targetsequences to determine the presence or absence of the target sequence ina sample. Generally speaking, this term will be understood by thoseskilled in the art. The target sequence may also be comprised ofdifferent target domains; for example, a first target domain of thesample target sequence may hybridize to a capture probe and a secondtarget domain may hybridize to a label probe, etc. The target domainsmay be adjacent or separated as indicated. Unless specified, the terms“first” and “second” are not meant to confer an orientation of thesequences with respect to the 5′-3′ orientation of the target sequence.For example, assuming a 5′-3′ orientation of the complementary targetsequence, the first target domain may be located either 5′ to the seconddomain, or 3′ to the second domain.

In one embodiment, genomic DNA, particular human genomic DNA, is used.Genomic DNA is obtained using conventional techniques, for example, asdisclosed in Sambrook et al., supra, 1999; Current Protocols inMolecular Biology, Ausubel et al., eds. (John Wiley and Sons, Inc., NY,1999), or the like, Important factors for isolating genomic DNA includethe following: 1) the DNA is free of DNA processing enzymes andcontaminating salts; 2) the entire genome is equally represented; and 3)the DNA fragments are between about 5,000 and 100,000 bp in length.

In many cases, no digestion of the extracted DNA is required becauseshear forces created during lysis and extraction will generate fragmentsin the desired range. In another embodiment, shorter fragments (1-5 kb)can be generated by enzymatic fragmentation using restrictionendonucleases. In one embodiment, 10-100 genome-equivalents of DNAensure that the population of fragments covers the entire genome. Insome cases, it is advantageous to provide carrier DNA, e.g. unrelatedcircular synthetic double-stranded DNA, to be mixed and used with thesample DNA whenever only small amounts of sample DNA are available andthere is danger of losses through nonspecific binding, e.g. to containerwalls and the like. In one embodiment, the DNA is denatured afterfragmentation to produce single stranded fragments.

Target polynucleotides may be generated from a source nucleic acid, suchas genomic DNA, by fragmentation to produce fragments of a specificsize; in one embodiment, the fragments are 50 to 600 nucleotides inlength. In another embodiment, the fragments are 300 to 600 or 200 to2000 nucleotides in length. In yet another embodiment, the fragments are10-100, 50-100, 50-300, 100-200, 200-300, 50-400, 100-400, 200-400,400-500, 400-600, 500-600, 50-1000, 100-1000, 200-1000, 300-1000,400-1000, 500-1000, 600-1000, 700-1000, 700-900, 700-800, 800-1000,900-1000, 1500-2000, 1750-2000, and 50-2000 nucleotides in length. Thesefragments may in turn be circularized for use in an RCR reaction or inother biochemical processes, such as the insertion of additionaladaptors.

Polynucleotides of the invention have interspersed adaptors that permitacquisition of sequence information from multiple sites, eitherconsecutively or simultaneously. Interspersed adaptors areoligonucleotides that are inserted at spaced locations within theinterior region of a target polynucleotide. In one aspect, “interior” inreference to a target polynucleotide means a site internal to a targetpolynucleotide prior to processing, such as circularization andcleavage, that may introduce sequence inversions, or liketransformations, which disrupt the ordering of nucleotides within atarget polynucleotide.

In one aspect, as is more fully outlined below, interspersed adaptorsare inserted at intervals within a contiguous region of a targetpolynucleotide. In some cases, such intervals have predeterminedlengths, which may or may not be equal. In other cases, the spacingbetween interspersed adaptors may be known only to an accuracy of fromone to a few nucleotides (e.g. from 1 to 15), or from one to a few tensof nucleotides (e.g. from 10 to 40), or from one to a few hundreds ofnucleotides (e.g. from 100 to 200). Preferably, the ordering and numberof interspersed adaptors within each target polynucleotide is known. Insome aspects of the invention, interspersed adaptors are used togetherwith adaptors that are attached to the ends of target polynucleotides.

In one aspect, the invention provides target polynucleotides in the formof concatemers which contain multiple copies (e.g. “monomers”) of atarget polynucleotide or a fragment of a target polynucleotide. DNAconcatemers under conventional conditions (a conventional DNA buffer,e.g. TE, SSC, SSPE, or the like, at room temperature) form random coilsthat roughly fill a spherical volume in solution having a diameter offrom about 100 to 300 nm, which depends on the size of the DNA andbuffer conditions, in a manner well known in the art, e.g. Edvinsson,“On the size and shape of polymers and polymer complexes,” Dissertation696 (University of Uppsala, 2002).

One measure of the size of a random coil polymer, such as singlestranded DNA, is a root mean square of the end-to-end distance, which isroughly a measure of the diameter of the randomly coiled structure. Suchdiameter, referred to herein as a “random coil diameter,” can bemeasured by light scatter, using instruments, such as a Zetasizer NanoSystem (Malvern Instruments, UK), or like instrument. Additional sizemeasures of macromolecular structures of the invention include molecularweight, e.g. in Daltons, and total polymer length, which in the case ofa branched polymer is the sum of the lengths of all its branches.

Upon attachment to a surface, depending on the attachment chemistry,density of linkages, the nature of the surface, and the like, singlestranded polynucleotides fill a flattened spheroidal volume that onaverage is bounded by a region which is approximately equivalent to thediameter of a concatemer in random coil configuration. Preserving thecompact form of the macromolecular structure on the surface allows amore intense signal to be produced by probes, e.g. fluorescently labeledoligonucleotides, specifically directed to components of a concatemer.

In some embodiments, classes of polynucleotides may be created byproviding adaptors having different anchor probe binding sites. Thistype of “clustering” allows for increased efficiency in obtainingsequence information of the polynucleotides.

Methods of Fragmentation

Effective mapping strategies are needed for sequencing applications suchas sequencing complex diploid genomes, de novo sequencing, andsequencing mixtures of genomes. In one embodiment, hierarchicalfragmentation procedures are provided to identify haplotype informationand assemble parental chromosomes for diploid genomes. Such proceduresmay also be applied to predicting protein alleles and to mapping shortreads to the correct positions within a genome. Another use for suchmethods is the correct assignment of a mutation in a gene family whichoccurs within ˜100 bases of DNA sequence shared between multiple genes.

FIG. (1C-D) illustrates one aspect of the invention, in which sourcenucleic acid (1600) (which may be, or contain, a single or severaltarget polynucleotides) is treated (1601) to form single strandedfragments (1602), preferably in the range of from 50 to 600 nucleotides,and more preferably in the range of from 300 to 600 nucleotides, whichare then ligated to adaptor oligonucleotides (1604) to form a populationof adaptor-fragment conjugates (1606). Adaptor (1604) is usually aninitial adaptor, which need not be “interspersed” in the sense that itseparates two sequences which were contiguous in the original sequence.Source nucleic acid (1600) may be genomic DNA extracted from a sampleusing conventional techniques, or a cDNA or genomic library produced byconventional techniques, or synthetic DNA, or the like. Treatment (1601)usually entails fragmentation by a conventional technique, such aschemical fragmentation, enzymatic fragmentation, or mechanicalfragmentation, followed by denaturation to produce single stranded DNAfragments.

In generating fragments in either stage, fragments may be derived fromeither an entire genome or from a selected subset of a genome. Manytechniques are available for isolating or enriching fragments from asubset of a genome, as exemplified by the following references, whichare incorporated in their entirety by reference: Kandpal et al (1990),Nucleic Acids Research, 18: 1789-1795; Callow et al, U.S. patentpublication 2005/0019776; Zabeau et al, U.S. Pat. No. 6,045,994; Deugauet al, U.S. Pat. No. 5,508,169; Sibson, U.S. Pat. No. 5,728,524;Guilfoyle et al, U.S. Pat. No. 5,994,068; Jones et al, U.S. patentpublication 2005/0142577; Gullberg et al, U.S. patent publication2005/0037356; Matsuzaki et al, U.S. patent publication 2004/0067493; andthe like.

In one embodiment, shear forces during lysis and extraction of genomicDNA generate fragments in a desired range. Also encompassed by theinvention are methods of fragmentation utilizing restrictionendonucleases.

In a preferred embodiment, particularly for mammalian-sized genomes,fragmentation is carried out in at least two stages, a first stage togenerate a population of fragments in a size range of from about 100kilobases (Kb) to about 250 kilobases, and a second stage, appliedseparately to each 100-250 Kb fragment, to generate fragments in thesize range of from about 50 to 600 nucleotides, and more preferably inthe range of from about 300 to 600 nucleotides, for generatingconcatemers for a random array. In some aspects of the invention, thefirst stage of fragmentation may also be employed to select apredetermined subset of such fragments, e.g. fragments containing genesthat encode proteins of a signal transduction pathway, and the like.

In one embodiment, the sample genomic DNA is fragmented using techniquesoutlined in U.S. Ser. No. 11/451,692, hereby incorporated by referencein its entirety. In this aspect, genomic DNA is isolated as 30-300 kbsized fragments. Through proper dilution, a small subset of thesefragments is, at random, placed in discreet wells of multi-well platesor similar accessories. For example a plate with 96, 384 or 1536 wellscan be used for these fragment subsets. An optimal way to create theseDNA aliquots is to isolate the DNA with a method that naturallyfragments to high molecular weight forms, dilute to 10-30 genomeequivalents after quantitation, and then split the entire preparationinto 384 wells. This provides representation of all genomic sequences,and performing DNA isolation on 10-30 cells with 100% recoveryefficiency assures that all chromosomal regions are represented with thesame coverage. By providing aliquots in this method, the probability ofplacing two overlapping fragments from the same region of a chromosomeinto the same plate well is minimized. For diploid genomes representedwith 10× coverage, there are 20 overlapping fragments on average toseparate into distinct wells. If this sample is distributed over a 384well plate, then each well contains, on average, 1,562 fragments. Byforming 384 fractions in a standard 384-well plate, there is only abouta 1/400 chance that two overlapping fragments may end up in the samewell. Even if some matching fragments are placed in the same well, theother overlapping fragments from each chromosomal region provide theunique mapping information.

In one embodiment, the prepared groups of long fragments are further cutto the final fragment size of about 300 to 600 bases. To obtainsufficient (e.g., 10×) coverage of each fragment in a group, the DNA ineach well may be amplified before final cutting using well-developedwhole genome amplification methods.

All short fragments from one well may then be arrayed and sequenced onone separate unit array or in one section of a larger continuous matrix.A composite array of 384 unit arrays is ideal for parallel analysis ofthese groups of fragments. In the assembly of long sequencesrepresenting parental chromosomes, the algorithm may use the criticalinformation that short fragments detected in one unit array belong to alimited number of longer continuous segments each representing adiscreet portion of one chromosome. In almost all cases the homologouschromosomal segments may be analyzed on different unit arrays. Long(˜100 Kb) continuous initial segments form a tailing pattern and providesufficient mapping information to assemble each parental chromosomeseparately as depicted below by relying on about 100 polymorphic sitesper 100 kb of DNA. In the following example dots represent 100-1000consecutive bases that are identical in corresponding segments.

Well 3 .......T........C........C...G........A........ Well 20   ....C........T........T...A........G........C... Well 157                  .......T...A........G........C........A...C... Well258              ...C........C...G........A........T........G...T....Wells 3 and 258 assemble    ...T........C........C...G........A........T........G...T chromosome1 of Parent 1: Wells 20 and 157 assemble    ...C........T........T...A..........G......C........A...C...chromosome 1 of Parent 2:

In one embodiment, amplification of the single targets obtained in thechromosomal separation procedure is accomplished using methods known inthe art for whole genome amplification. In a preferred embodiment,methods that produce 10-100 fold amplification are used. In oneembodiment, these procedures do not discriminate in terms of thesequences that are to be amplified but instead amplify all sequenceswithin a sample. Such a procedure does not require intact amplificationof entire 100 kb fragments, and shorter fragments, such as fragmentsfrom 1-10 kb, can be used.

Composition/Structure of Interspersed Adaptors

In one aspect, interspersed adaptors are inserted at intervals within acontiguous region of a target polynucleotide. Interspersed adaptors mayvary widely in length, which depends in part on the number and type offunctional elements desired. Such functional elements include, but arenot limited to, anchor sequences, sequences complementary to captureprobe sequences (e.g. for attachment to surfaces), tagging sequences,secondary structure sequences, sequences for attachment/hybridization oflabel probes, functionalization sequences, primer binding sites,recognition sites for nucleases, such as nicking enzymes, restrictionendonucleases, and the like.

In one embodiment, the adaptors comprise a restriction endonucleaserecognition site as known in the art. In one embodiment, suchrecognition sites can be for nicking enzymes.

In one embodiment, the restriction endonuclease site is a Type IIsrestriction endonuclease site. Type-IIs endonucleases are generallycommercially available and are well known in the art. Like their Type-IIcounterparts, Type-IIs endonucleases recognize specific sequences ofnucleotide base pairs within a double stranded polynucleotide sequence.Upon recognizing that sequence, the endonuclease will cleave thepolynucleotide sequence, generally leaving an overhang of one strand ofthe sequence, or “sticky end.” Type-IIs endonucleases also generallycleave outside of their recognition sites; the distance may be anywherefrom 2 to 20 nucleotides away from the recognition site. Because thecleavage occurs within an ambiguous portion of the polynucleotidesequence, it permits the capturing of the ambiguous sequence up to thecleavage site, under the methods of the present invention. Usually, typeIIs restriction endonucleases are selected that have cleavage sitesseparated from their recognition sites by at least six nucleotides (i.e.the number of nucleotides between the end of the recognition site andthe closest cleavage point). Exemplary type IIs restrictionendonucleases include, but are not limited to, Eco57M I, Mme I, Acu I,Bpm I, BceA I, Bbv I, BciV I, BpuE I, BseM II, BseR I, Bsg I, BsmF I,BtgZ I, Eci I, EcoP15 I, Eco57M I, Fok I, Hga I, Hph I, Mbo II, Mnl I,SfaN I, TspDT I, TspDW I, Taq II, and the like.

In some embodiments, each adaptor comprises the same Type IIsrestriction endonuclease site. In alternative embodiments, differentadaptors comprise different sites.

In one embodiment, one or more of the adaptors comprise anchor probehybridization sites. As is outlined below, anchor probes are used insequencing reactions, and can take a variety of forms. In general, atleast one end of the anchor probe hybridization site is at the junctionbetween the target sequence and the adaptor; that is, sequencingreactions generally rely on hybridization of the anchor probe directlyadjacent to detection positions of the target sequence. The anchor orprimer may be selected or designed to be or to have one to about ten ormore, preferably one to four bases, shifted left or right from thetarget-adaptor junction. As used herein, “detection position” refers toa position in a target sequence for which sequence information isdesired.

In many embodiments, sequencing reactions can be run off both ends ofthe anchor probes; thus, in some embodiments, the anchor probehybridization site comprises the entire adaptor sequence. Alternatively,there may be two anchor probe hybridization sites within each adaptor;one adjacent or close to the 3′ end of the target sequence and oneadjacent or close to the 5′ end. As will be appreciated by those in theart, depending on the length of the anchor probes and the length of theadaptor, two anchor probe hybridization sites may overlap within theadaptor, they may be directly adjacent, or they may be separated byintervening sequences. The length of the anchor probe hybridizationsequence will vary depending on the conditions of the assay.

In one embodiment, one or more of the adaptors comprise a primer bindingsequence. As is known in the art, polymerases generally require a singlestranded template (the concatemers, for example) with a portion ofdouble stranded nucleic acid. Essentially, any sequence can serve as aprimer binding sequence, to bind a primer, as any double strandedsequence will be recognized by the polymerase. In general, the primerbinding sequence is from about 3 to about 30 nucleotides in length, withfrom about 15 to about 25 being preferred. Primer oligonucleotides areusually 6 to 25 bases in length. As will be appreciated by those in theart, the primer binding sequence can be contained within any of theother adaptor sequences.

In one embodiment, one or more of the adaptors comprise a capture proberecognition sequence. As is more fully outlined below, one embodiment ofthe invention utilizes capture probes on the surface of a substrate toimmobilize the DNBs. In this embodiment, the adaptors comprise a domainsufficiently complementary to one or more capture probes to allowhybridization of the domain and the capture probe, resulting inimmobilization of the DNBs on the surface.

In one embodiment, one or more of the adaptors comprise a secondarystructure sequence. For example, palindromic sequences in a plurality ofadaptors within the concatemer results in hybridization between adaptors(e.g. intramolecular interactions between copies in the concatemer) thus“tightening” the three dimensional structure of the DNA nanoball(“DNBs”). These palindromic sequence units can be 5, 6, 7, 8, 9, 10 ormore nucleotides in length and of various sequences, such as sequenceschosen to provide a specific melting temperature. For example, apalindrome AAAAAAATTTTTTT (SEQ. ID NO. 8) will provide a 14 bases dsDNAhybrid between neighboring any two unit replicas in the form of:

AAAAAAATTTTTTT (SEQ. ID NO.8) TTTTTTTAAAAAAA (SEQ. ID NO.9)

In one embodiment, the adaptors comprise label probe binding sequences.In some embodiments, for example for detection of particular sequencesrather than sequencing reactions, label probes can be added to theconcatemers to detect particular sequences. Label probes will hybridizeto the label probe binding sequence and comprise at least one detectablelabel, as is outlined herein. For example, detection of the presence ofinfectious agents such as bacteria or viruses can be done in thismanner.

In one embodiment, the adaptors comprise tagging sequences. In thisembodiment, tagging sequences may be used to pull out or purifycircularized target sequences, concatemers, etc. In some embodiments,tagging sequences may include unique nucleic acid sequences that can beutilized to identify the origin of target sequences in mixtures oftagged samples, or can include components of ligand binding pairs, suchas biotin/streptavidin, etc.

In one aspect, interspersed adaptors each have a length in the range offrom 8 to 60 nucleotides; in another aspect, they have a length in therange of from 8 to 32 nucleotides; in another aspect, they have a lengthin a range selected from about 4 to about 400 nucleotides; from about 10to about 100 nucleotides, from about 400 to about 4000 nucleotides, fromabout 10 to about 80 nucleotides, from about 20 to about 70 nucleotides,from about 30 to about 60 nucleotides, and from about 4 to about 10nucleotides. Embodiments utilizing adaptors with a total length fromabout 20 to about 30 bases find particular use in several embodiments.

The number of interspersed adaptors inserted into target polynucleotidesmay vary widely and depends on a number of factors, including thesequencing/genotyping chemistry being used (and its read-lengthcapacity), the particular length of the cleavage site of a particularType IIs site, the number of nucleotides desired to be identified withineach target polynucleotide, whether amplification steps are employedbetween insertions, and the like.

In one aspect, a plurality of interspersed adaptors are inserted atsites in a contiguous segment of a target polynucleotide; this mayinclude two, three, four or more interspersed adaptors that are insertedat sites in a contiguous segment of a target polynucleotide.Alternatively, the number of interspersed adaptors inserted into atarget polynucleotide ranges from 2 to 10; from 2 to 4; from 3 to 6;from 3 to 4; and from 4 to 6. In another aspect, interspersed adaptorsmay be inserted in one or both polynucleotide segments of a longerpolynucleotide, e.g., 0.4-4 Kb in length, that have been ligatedtogether directly or indirectly in a circularization operation (referredto herein as a “mate-pair”). In one aspect, such polynucleotide segmentsmay be 4-400 (preferably 10-100) bases long.

It should also be noted that in general, the first adaptor attached to atarget sequence is not “interspersed” or “inserted”. That is, the firstadaptor is generally attached to one terminus of the fragmented targetsequence, and the subsequent adaptors are interspersed within acontiguous target sequence.

In one aspect, each member of a group of target polynucleotides has anadaptor with an identical anchor probe binding site and type IIsrecognition site attached to a DNA fragment from source nucleic acid. Inanother embodiment, classes of polynucleotides may be created byproviding adaptors having different anchor probe binding sites.

In one aspect, adaptors are inserted at intervals within a contiguousregion of a target polynucleotide in which the intervals havepre-determined lengths. These pre-determined lengths may or may not beequal. In some embodiments the length of the intervals are known to anaccuracy of about 1 to 200 nucleotides, in other embodiments from about1-15, 10-40 and 100-200 nucleotides.

Interspersed adaptors may in accordance with the invention be single ordouble stranded.

In one aspect, adaptors include palindromic sequences, which fosterintramolecular interactions within the target polynucleotide, resultingin a “nano-ball”.

Methods for Inserting a Plurality of Adaptors

One aspect of the invention provides a method for producing a targetpolynucleotide having interspersed adaptors, as illustrateddiagrammatically in FIGS. (1A-1B). In this method, target polynucleotide(1002) is combined with adaptor (1000), which may or may not be aninterspersed adaptor, to form (1004) circle (1005), which may be eithersingle stranded or double stranded. The target polynucleotide isgenerally obtained by fragmentation of a larger piece of DNA, such aschromosomal or other genomic DNA.

If double stranded DNA is used, then the ends of the fragments may beprepared for circularization by “polishing” and optional ligation ofadaptors using conventional techniques, such as employed in conventionalshotgun sequencing, e.g. Bankier, Methods Mol. Biol., 167: 89-100(2001); Roe, Methods Mol. Biol., 255: 171-185 (2004); and the like.

In order to generate the next site for inserting a second interspersedadaptor, circle (1005) is typically rendered double stranded, at leasttemporarily. Adaptor (1000) is designed in this aspect of the inventionto include a recognition site of a type IIs restriction endonuclease,which is oriented so that its cleavage site (1006) is interior to thetarget polynucleotide, shown, for example, to the right of adaptor(1000), thereby opening (1008) circle (1005). In a preferred embodiment,the method of inserting interspersed adaptors employs type IIsrestriction endonucleases that leave 3′ protruding strands aftercleavage. For less precise insertion, a nicking enzyme may be used, orone strand of the first adaptor may be disabled from ligation, thuscreating a nick that can be translated at an approximate distance andused to initiate polynucleotide cutting.

After the polynucleotide is cleaved, interspersed adaptor (1010) isligated into place using conventional techniques to produce open circle(1012) containing two adaptors, which is then closed (1016) by ligation.The process is then repeated (1018): cleaving, inserting, and closing,until a desired number of interspersed adaptors, such as three, areinserted (1026) into target polynucleotide (1002), as shown in FIG. 1B.The final circle (1024) containing the interspersed adaptors may then beprocessed in a number of ways to obtain sequence information at sites inthe target polynucleotide adjacent to at least one boundary of eachinterspersed adaptor.

Typically, sequences of a target polynucleotide are analyzed at oradjacent to one or both of the boundaries (e.g. 1021) between eachinterspersed adaptor and the target polynucleotide. In one aspect, finalcircle (1024), or a segment of it, may be amplified to generate anamplicon that is analyzed by a selected sequencing chemistry, such asone based on ligation or sequencing-by-synthesis. In one aspect, thefirst and last interspersed adaptors may be selected so that the regionof final circle (1024) containing the interspersed adaptors can becleaved (1038) from the circle, after which adaptors are ligated (1040)for amplification by polymerase chain reaction (PCR). Cleavage of thecircle may be performed on one or two sites outside of adaptors 1 and 3.In another aspect, final circle (1024) may be used directly to generateamplicons by rolling circle replication (RCR), as described more fullybelow.

For applications in which many different target polynucleotides areanalyzed in parallel, target polynucleotides having interspersedadaptors may be amplified using RCR or emulsion PCR as shown in FIGS.(1C-1D) and FIGS. (1E-1G), respectively.

In emulsion PCR, a mixture of fragments may be amplified, e.g. asdisclosed by Margulies et al, Nature, 437: 376-380 (2005); Shendure etal (2005), Science, 309: 1728-1732; Berka et al, U.S. patent publication2005/0079510; Church et al, PCT publication WO 2005/082098; Nobile etal, U.S. patent publication 2005/0227264; Griffiths et al, U.S. Pat. No.6,489,103; Tillett et al, PCT publication WO 03/106678; Kojima et al,Nucleic Acids Research, 33 (17): e150 (2005); Dressman et al, Proc.Natl. Acad. Sci., 100: 8817-8822 (2003); Mitra et al, Anal. Biochem.,320: 55-65 (2003); Musyanovych et al, Biomacromolecules, 6: 1824-1828(2005); Li et al, Nature Methods, 3: 95-97 (2006); and the like, whichare incorporated herein by reference in their entirety for all purposes.

Briefly, as illustrated in FIG. (1E), after isolation of DNA circles(1500) comprising target polynucleotides with interspersed adaptors, theadaptors are excised, e.g. as shown in FIG. 1A (1038), to form apopulation of excised sequences, which are then ligated to adaptors(1503). The adaptored sequences are combined in a water-oil emulsion(1505) with primers specific for an adaptor ligated to one end ofexcised sequences, beads having attached primers specific for an adaptorligated to the other end of excised sequences, and a DNA polymerase.Conditions are selected that permit a substantial number (e.g. greaterthan 15-20 percent) of aqueous bubbles (1508) in oil (1506) to contain asingle adaptored sequence (1510) and at least one bead (1512). Theaqueous phase in bubbles (1508) otherwise contain a conventionalreaction mixture for conduction PCR, which results in beads (1518) eachhaving a clonal population of a distinct adaptored sequence attached.

In one aspect of the invention, the introduction of multipleinterspersed adaptors into a single genomic fragment proceeds through aseries of steps involving 1) ligation of an initial adaptor harboring abinding site for a IIs restriction enzyme and closing the DNA circle,followed by 2) primer extension and selective restriction cutting of thegenomic sequence to reopen the circle; and 3) ligation of second adaptorand closing the DNA circle. Steps 2 and 3 are then repeated toincorporate a third adaptor into the genomic sequence (FIGS. 2B and 2C).The second adaptor may utilize the same restriction site as the firstadaptor to minimize cutting genomic segments at an internal site of thegenomic DNA. In one embodiment, controlled cleavage using therecognition site of the second adaptor and not of the first adaptor isaccomplished by blocking the cleavage at the first adaptor restrictionsite using techniques known in the art, such as by methylating the firstrestriction site prior to cutting at the second site.

Adaptors with different binding sites may be used with two aliquots of asample to prevent exclusion of certain genomic fragments. In oneembodiment, a part of the sequence of the final adaptor is used as anRCR priming site and another part of the adaptor is used as a bindingsite for an anchor oligonucleotide attached to a glass surface.

In one aspect of the invention, a method for inserting adaptors into agenomic fragment begins with ligation of a first adaptor followed bycircle formation. Genomic fragments of 100 to 300 (or 300-600) bases inlength may be prepared by DNAse fragmentation that generates 5-primephosphates and 3-prime OH groups suitable for ligation. High-complexitygenomic DNA can be prepared as single stranded (ss) DNA by heating(denaturation) and rapid cooling. Since the DNA is of high complexity,the localized concentration of the complementary sequence for anyfragment may be negligible, thus allowing sufficient time to performsubsequent procedures when the DNA is mostly in the single strandedstate. The use of ssDNA significantly simplifies circle formationbecause of the distinct polarity of 5′ and 3′ ends of each ssDNAfragment. The first stage is ligation of adaptor sequences to the endsof each single stranded genomic fragment. Since all possible sequencecombinations may be represented in the genomic DNA, an adaptor can beligated to one end with the aid of a bridging template molecule that issynthesized with all possible sequences (FIG. 2B). Since theseoligonucleotides may be of relatively high concentration compared to thegenomic DNA, the oligonucleotide that is complementary to the end of thegenomic fragment (or a complement with mismatches) may hybridize. Abridge is thus formed at the ligation site to allow ligation of the5-prime end of the single stranded genomic fragment to the adaptor. Inone embodiment, this structural arrangement does not allow ligation ofthe adaptor to the 3-prime end of the fragment.

In FIG. 2B, another exemplary method is illustrated for incorporatingmultiple interspersed adaptors into DNA circles. Such method comprisesthe steps of: 1. Ligation of adaptors (230) to the 5′ and 3′ end ofsingle stranded DNA (232) (the adaptors having degenerate (6-9 bases)bridge templates (234)) followed by ligation of the adaptors via a3-base overhangs (236); 2. Extension (238) from the adaptoroligonucleotide with a polymerase to create double stranded DNA for typeIIs restriction enzyme cutting; 3. A cut (242) at 12-16 bases downstreamof the type IIs recognition site (240) opens the circle; 4. Heatingresults in loss of new strands (243); and 5. The fragment is ready forintroduction of another adaptor (230) and closing the circle again.

Capture of the 3′ end into the circle requires the use of anoligonucleotide template that again is prepared with degenerate bases sothat a bridge structure is formed over the ligation site. The secondadaptor section at the 3′ end of the genomic fragment is used to closethe circle with a 3-base overhang that is complementary to the end ofthe adaptor that bound at the 5′ end. By performing the attachment ofthis adaptor segment at a temperature that favors hybridization of thetemplate bridge (but not the 3 base overhang), the excess bridgemolecule can be removed by buffer exchange since the genomic/adaptormolecule is attached to a solid support. A 3-base overhang is sufficientfor circle formation but would not be favored until the temperature wasdecreased. The use of two bridging oligonucleotides with degeneratebases can eliminate artifacts created by the diverse sequence ends ofthe genomic DNA. In a preferred embodiment, both bridgingoligonucleotides attach independently of each other to ensure freedom ofthe degenerate oligonucleotides to bind to their complementarysequences. Both of the adaptor components may be ligated to therespective DNA ends in the same ligation reaction and ligation artifactscan be further prevented by designing bridging template oligonucleotideswith blocked ends.

The incorporation of a capture mechanism such as biotin/streptavidinonto the non-circle adaptor strand can be used in a down-stream cleanupprocesses. In such an embodiment, since both unligated and ligatedbiotynilated adaptors are present, the un-ligated excess adaptor can beremoved by size selection of adaptor-genomic fragments that are ˜200bases in length. The adaptor-genomic fragments can then be attached tostreptavidin coated beads for subsequent cleaning procedures. Anotheroption is to use beads with a capture oligonucleotide (possiblyincorporating PNA or LNA) complementary to a portion of one ligatedadaptor. Beads with a pre-assembled left side of the firstadaptor/template may be used to further simplify the process.

In FIG. 2C, another exemplary method for incorporating interspersedadaptors is illustrated. The method comprises the following steps: (1)Ligate two adaptor segments (250 and 252) to single stranded DNAfragments (254) using template oligonucleotides (the double strandedsegment of 250 may be about 10 bases long, and the double strandedsegment of 252 may be 8-10 bases long) containing degenerate bases (forexample, segments 256 and 258 show the use of 7 degenerate bases, but 8degenerate bases could also be used). Both ends of templateoligonucleotides (250 and 256) are blocked from ligation with dideoxytermination on the 3′ ends and either OH-group or biotin on the 5′ ends.The adaptor/template hybrids are used at very high concentrations suchas 1 μM and are in 1000-folds excess concentrations over genomic DNA.(2) DNA is collected on streptavidin support (260) via the biotin on the5′ end of the 3′ adaptor (250). Excess free 5′ adaptors are removed withthe supernatant. (3) DNA is released from the streptavidin support byelevated temperature and the supernatant is collected. (4) DNA isrecaptured to a solid support using a long capture oligonucleotide (262)with 3′ end blocked by dideoxy termination. The oligonucleotide may bein the form of a peptide nucleic acid (PNA) to provide tight binding ofthe DNA to the solid support to facilitate removal of excess freeadaptors in subsequent procedures. Capture oligonucleotide (262) can beextended by addition of 1-10 degenerate bases at the 5′ end (264) forbinding the genomic portion to increase stability. (5) The bridgetemplate (266, which may be 14-18 bases long) is used to bring the twoends of the adaptors together to circularize the DNA molecule. It willbe blocked on the 5′ end with an amide group, but the 3′-OH group willbe available for subsequent elongation by DNA polymerase in later steps.Kinase and ligase are provided in the reaction to phosphorylate the 5′end of the 5′ adaptor and the ligation of the two ends of the DNAmolecule.

In another exemplary capture procedure for inserting multiple adaptors,two adaptor segments are ligated to genomic ssDNA fragments usingdegenerate templates (FIG. 2C). The 3′ end of the adaptor segment thatligates to the 5′ end of the genomic DNA has a blocking complement. Thetemplate for the 3′ adaptor segment has biotin. Adaptor/templates are invery high concentration such as 1 μM and have ˜1000× high concentrationfrom genomic DNA. DNA is collected on a streptavidin support and thesolution is removed with the excess of adaptor components. The genomicDNA is released at an elevated temperature and the DNA solution iscollected. The DNA is collected again on a second solid support with along oligonucleotide (with blocked ends) complementary to the 5′ endadaptor segment with removal of all other synthetic DNA. A bridgingtemplate is then added that serves also as a primer. Kinase and ligase(and polymerase) are added to close the circle and extend the primer toabout 30 bases. Extension is controlled by time or by presence ofddNTPs. The enzymes are heat inactivated and the DNA is then cut with atype IIS restriction enzyme. The short double stranded portions areremoved at elevated temperature with the circle attached to the solidsupport via a strong hybrid to the attached oligonucleotide. Thisstronger hybrid is maintained by incorporating LNA or PNA bases into theoligonucleotide. Two adaptor segments with templates for the secondadaptor are then added (same design as above) no additional solidsupport attachment is required since the circle DNA will be continuallyassociated with the solid support for further steps. Elevatedtemperatures are used to remove templates bound to the circular DNA.This step is repeated to insert a third adaptor. If no additionaladaptors are to be inserted, then no polymerase is added and after abuffer exchange the DNA is released at elevated temperatures for the RCRreaction.

Another exemplary method of inserting interspersed adaptors isillustrated in FIG. 2D. This method generates segments of targetpolynucleotide with predetermined lengths adjacent to interspersedadaptors. The predetermined lengths are selected by selecting andpositioning type IIs restriction endonucleases within the interspersedadaptors. In one aspect of this method, each different interspersedadaptor from the initial adaptor to the penultimate adaptor has arecognition site of a different type IIs restriction endonuclease.Double stranded DNA (dsDNA) is fragmented to produce targetpolynucleotides (270) having frayed ends (269), after which such endsare repaired using conventional techniques to form fragments (271) withblunt ends. To the 3′ ends of blunt end fragments (271) a singlenucleotide (273) is added, e.g. dA, using Taq polymerase, or likeenzyme, to produce augmented fragments (272). Augmented fragments (272)are combined with interspersed adaptors (274) that have complementarynucleotide overhangs, e.g. dT, in the presence of a ligase so thatmultiple ligation products form, including product (275) that comprisesa single interspersed adaptor and a single fragment. Conditions can beadjusted to promote the circularization (276) of product (275) so thatdsDNA circles (283) are formed. Other products, such as conjugates withinterspersed adaptors at both ends or unligated fragments and adaptors,will not generally have the ability to form circles and can be removedthrough digestion with a single stranded exonuclease aftercircularization of product (275).

dsDNA circles (283) are treated with a type IIs restriction endonucleaserecognizing a site in adaptor (278) to cleave dsDNA circles (283) toleave segment (277) of target polynucleotide (270) adjacent to adaptor(278). In this embodiment, cleavage by the type IIs restrictionendonuclease leaves 3′ indented ends that are extended by a DNApolymerase to form blunt ends (279), after which fragment (284) istreated to add a single nucleotide to its 3′ ends, as above. To fragment(284), a second interspersed adaptor (281) having complementaryoverhangs is ligated, and the process repeated to incorporate additionalinterspersed adaptors. In one embodiment, each cycle of interspersedadaptor incorporation includes an amplification step of the desiredproduct to generate sufficient material for subsequent processing steps.

In FIG. 2E, another exemplary method is illustrated for incorporatinginterspersed adaptors at predetermined sites in a target polynucleotide.Fragments are generated as in FIG. 2D and dsDNA circles (285) areproduced that have an initial interspersed adaptor (286) containing atype IIs recognition site, as described above, that cleaves dsDNA circle(285) at a predetermined site (287) to give fragment (288) having 3′overhangs (289), which may have lengths different than two. Interspersedadaptor of fragment (288) either contains a nick (290) at the boundaryof the adaptor and the fragment or it contains the recognition site fora nicking endonuclease that permits the introduction of a nick (291) atthe interior of the adaptor. In either case, fragment (288) is treatedwith a DNA polymerase (292) that can extend the upper strand from a nick(e.g. 291) to the end of the lower strand of fragment (288) to form afragment having a 3′ overhang at one end and a blunt end at the other.To this fragment is ligated an interspersed adaptor (294) that hasdegenerate nucleotide overhang at one end and a single 3′ nucleotide(e.g. dT) overhang at the other end to form fragment (295), which istreated (e.g. with Taq polymerase) to add a 3′ dA to its blunt endforming fragment (296). Fragment (296) is then circularized by ligationat site (297) to form dsDNA circle (298) and other ligation products aredigested, as described above. Additional cycles of this process may becarried out to incorporate additional interspersed adaptors, and asabove, optional steps of amplification may be added in each cycle, or asneeded.

In FIG. 2F, another method of incorporating interspersed adaptors isillustrated that provides segments of variable lengths betweeninterspersed adaptors. That is, interspersed adaptors are incorporatedin a predetermined order, but at spacings that are not precisely known.This method allows incorporation of adaptors at distances longer thanthose provided by known restriction enzymes. As above, dsDNA circles(2000) are prepared having an initial adaptor (2002) (that may or maynot be an interspersed adaptor) containing a recognition site (2004) fora nicking enzyme. After creation of nick (2006), dsDNA circle (2000) istreated with a DNA polymerase (2008) that extends (2010) the free 3′strand and displaces or degrades the strand with the free 5′ end at nicksite (2006). The reaction is stopped after a predetermined interval,which is selected to be shorter than the expected time to synthesizemore than a few hundred bases. Such extension may be halted by a varietyof methods, including changing reaction conditions such as temperature,salt concentration, or the like, to disable the polymerase being used.This leaves dsDNA circle with a nick or other gap (2012), which isrecognized and cleaved by a variety of enzymes having nucleaseactivities, such as DNA polymerases, FEN-1 endonucleases, S1 nuclease(2014), and the like, which may be used alone or in combination, e.g.Lieber, BioEssays, 19: 233-340 (1997). After cleavage at nick or gap(2012), the ends of the target polynucleotide may be repaired usingtechniques employed in shotgun sequencing, after which targetpolynucleotide (2000) may be cleaved (2017) to the left of adaptor(2002) using a type IIs restriction endonuclease that leaves astaggered, or sticky, end. To the blunt end, the next interspersedadaptor is attached, after which the resulting construct may becircularized using conventional techniques for further insertions ofinterspersed adaptors. In one embodiment, the distances betweensuccessive interspersed adaptors, e.g. (2002) and (2018), are not knownprecisely and depend on the cleaving enzyme employed, the polymeraseemployed, the time interval allowed for synthesis, the method ofstopping synthesis, reaction conditions, such as dNTP concentrations,and the like.

In one embodiment, at step (2010), nick translation can be used insteadof strand displacement. In one aspect, in the polynucleotide break(2016) second adaptor may be ligated only to the sided connected to thefirst adaptor. This method can be combined with a second cut on theopposite side of the adaptor (2006) to create a mate-pair structure withvarious lengths of two segments such as (10-50)+(30-300) bases.

In one aspect, the invention provides a method for inserting adaptorsusing CircLigase™ to close single stranded polynucleotide circleswithout template. This enzyme provides the ability to use adaptors assingle oligonucleotides and to use only one template. In this method,after an adaptor is ligated to the 5′ end of the target polynucleotideusing standard ligase such as T4 DNA ligase, the excess adaptor andtemplate is removed. CircLigase™ (and kinase if the adaptor is notphosphorylated at the 5′ end) can then be used to close single strandedpolynucleotide circles.

In one embodiment, after the initial adaptor is inserted into thepolynucleotide, it may need to be released from the support to be ableto form a single stranded circle. The polynucleotide can then bere-hybridized to the support; in one embodiment, this re-hybridizationoccurs on a capture oligonucleotide which is bound to the surface of thesupport. A primer is added together with polymerase after closing thecycle for generating local dsDNA and allowing the cutting with type IISrestriction enzymes:

      |-NNNNNNNUUUUUUUUUUU-|GGGGGGGGGGGGGG.UUUUUUUUUUUUUUUUUUUUUUUUUUU-5′OH 3′OH-GGGGGGGGGG . . .

Ligation of multiple adaptors may be prevented by starting with 5′OH orby having long blocking template possibly in the form of a hairpin:

|-NNNNNNNUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU- SolidUUUUUUUUUUUUUUUUUUUUUUUUUUUU-P|-UUUUUUU-|where U=common base, N=degenerate base, P=phosphate, G=genomic or DNA ofinterest.

Once circle formation has occurred, a primer already pre-hybridized tothe adaptor is extended with a polymerase to create enough doublestranded DNA for type IIs restriction enzyme cutting allowing preciseinsertion of additional adaptors (FIG. 9). A polymerase such as Klenowmay be used, along with a level of ddNTPs to control extension length toabout 20-30 bases.

Inserting two additional adaptors can in some embodiments of theinvention take 2-3 hours if each enzymatic step is accomplished in lessthan 30 minutes. Sporadic errors created in the adaptor insertionprocess can be tolerated because of the redundant tens of overlappingsequences generated for each base and because of probe-probe data thatis generated on more than 100 bases of each DNA fragment that is notsubjected to adaptor insertion.

In one exemplary method, multiple adaptors can be inserted by preparingdsDNA circles with a 50-100 bases+25 base mate-pair at >1 Kb distance.In this method, a dsDNA circle of a ˜1-3 Kb genomic fragment is providedwith an adaptor using A/T or blunt-end ligation. In one embodiment, theadaptor has a nicking enzyme binding site or it has one Uracil or othercleavable or photo-cleavable base analogs or one 3′ end that is notligated and recognition sites for two different IIS binding enzymes.

In one embodiment, the DNA is cut using a nicking enzyme or at Uracilsites and the available 3′ end is extended (or just extended if adaptorligation has left a nick) by ˜75 bases with strand-displacement enzymeor nick translation enzyme; in the case of using a unligated 3′ site,the displacement would be through the adaptor, e.g. the length would be75 bases plus the length of the adaptor. The available 3′ end may beremoved by nick translation or by DNA synthesis with stranddisplacement. The cut can be at a nick or at a branched structure by oneof several enzymes including single stranded cutting enzymes. Thisprocess results in a dsDNA fragment 30-110 bases next to one end of theinitial adaptor. The DNA can then be cut with a Type IIS restrictionendonuclease that has a long cutting distance. In one embodiment, thecutting distance is from 18 to 25 bases. The circle can be closedwithout adaptor (blunt end ligation of genomic fragments) or bydirectional blunt end ligation of a second adaptor. Both adaptors may beused for further insertion of additional adaptors using different or thesame enzymes. If the first adaptor site is methylated before insertionof the second adaptor the second adaptor can use the same restrictionsite positioned at the proper distance from the adaptor end to obtaincutting at the specific position in the genomic DNA.

Methods of Circularization

Various standard DNA circle formation procedures may be used. Oneexample is blunt end ligation of the adaptor. A problem with thisapproach is orientation and ligation of multiple incorporated adaptors.One strand of the cassette may have both the 5′ and 3′ ends blocked toligation. Orientation of the cassette will determine which DNA strandwill have a free 3′ end to initiate RCR. This will allow each strand tobe replicated in about 50% of cases.

DDDDDDDDDXLLLLLLLLLLLLXDDDDDDDDDDD DDDDDDDDDOLLLLLLLLLLLLODDDDDDDDDDDDDDDDDDDDOLLLLLLLLLLLLODDDDDDDDDDD DDDDDDDDDXLLLLLLLLLLLLXDDDDDDDDDDD D= DNA, L = adaptor, X = blocked ligation site, O = open to ligation

As will be appreciated by those in the art, there are several ways toform circularized adaptor/target sequence components. In one embodiment,a CircLigase™ enzyme is used to close single stranded polynucleotidecircles without template. Alternatively, a bridging template that iscomplementary to the two termini of the linear strand is used. In someembodiments, the addition of a first adaptor to one termini of thetarget sequence is used to design a complementary part of the bridgingtemplate. The other end may be universal template DNA containingdegenerate bases for binding to all genomic sequences. Hybridization ofthe two termini followed by ligation results in a circularizedcomponent. Alternatively, the 3′ end of the target molecule may bemodified by addition of a poly-dA tail using terminal transferase. Themodified target is then circularized using a bridging templatecomplementary to the adaptor and to the oligo-dA tail.

In another embodiment, biotin is incorporated into each templateoligonucleotide used to guide ligation. This allows for easy removal oftemplates, for example by applying high temperature melting, whichremoves the templates without removing formed circles. These longeroligonucleotides can serve as primers for RCR or be used for otherpurposes such as inserting additional cassettes.

In another embodiment, the target DNA may be attached to some solidsupport such as magnetic beads or tube/plate well walls to allow removalof all templates or adaptors that are not covalently ligated to thetarget DNA. Target ssDNA may be attached using a support with randomprimers to extend and create about 20-80 bases of dsDNA. The extensionlength may be controlled by time or by the amount of ddNTPs. Anotherapproach is to ligate an adaptor to one end of the ssDNA and then sizeselect DNA with the adaptor ligated to the ssDNA, and at the same timeremoving free adaptor. In this case an anchor sequence about 10-50 basesin length complementary to part of the adaptor may be attached to thesupport to capture DNA and use it for subsequent steps. This anchormolecule may have additional components to increase hybrid stability,such as the incorporation of a peptide nucleic acid. Another method forattaching single stranded DNA is by utilizing a single stranded DNAbinding protein attached to the support.

In one method of circularization, illustrated in FIG. 2A, after genomicDNA (200) is fragmented and denatured (202), single stranded DNAfragments (204) are first treated with a terminal transferase (206) toattach a poly dA tails (208) to 3-prime ends. This is then followed byligation (212) of the free ends intra-molecularly with the aid ofbridging oligonucleotide (210) that is complementary to the poly dA tailat one end and complementary to any sequence at the other end by virtueof a segment of degenerate nucleotides. Duplex region (214) of bridgingoligonucleotide (210) contains at least a primer binding site for RCRand, in some embodiments, sequences that provide complements to acapture oligonucleotide, which may be the same or different from theprimer binding site sequence, or which may overlap the primer bindingsite sequence. The length of capture oligonucleotides may vary widely,In one aspect, capture oligonucleotides and their complements in abridging oligonucleotide have lengths in the range of from 10 to 100nucleotides; and more preferably, in the range of from 10 to 40nucleotides. In some embodiments, duplex region (214) may containadditional elements, such as an oligonucleotide tag, for example, foridentifying the source nucleic acid from which its associated DNAfragment came. That is, in some embodiments, circles or adaptor ligationor concatemers from different source nucleic acids may be preparedseparately during which a bridging adaptor containing a unique tag isused, after which they are mixed for concatemer preparation orapplication to a surface to produce a random array. The associatedfragments may be identified on such a random array by hybridizing alabeled tag complement to its corresponding tag sequences in theconcatemers, or by sequencing the entire adaptor or the tag region ofthe adaptor. Circular products (218) may be conveniently isolated by aconventional purification column, digestion of non-circular DNA by oneor more appropriate exonucleases, or both.

DNA fragments of the desired sized range, e.g. 50-600 nucleotides, canbe circularized using circularizing enzymes, such as CircLigase, assingle stranded DNA ligase that circularizes single stranded DNA withoutthe need of a template. A preferred protocol for forming single strandedDNA circles comprising a DNA fragment and one or more adaptors is to usea standard ligase, such as T4 ligase, for ligating an adaptor to one endof a DNA fragment followed by application of CircLigase to close thecircle.

In an exemplary method, a DNA circle comprising an adaptoroligonucleotide and a target sequence is generated using T4 ligaseutilizes a target sequence that is a synthetic oligonucleotide T1N(sequence: 5′-NNNNNNNNGCATANCACGANGTCATNATCGTNCAAACGTCAGTCCANGAATCNAGATCCACTTAGANTGNCGNNNNNN-3′) (SEQ ID NO: 1). The adaptor is made up of2 separate oligonucleotides. The adaptor oligonucleotide that joins tothe 5′ end of T1N is BR2-ad (sequence:5′-TATCATCTGGATGTTAGGAAGACAAAAGGAAGCTGAGGACATTAACGGAC-3′) (SEQ ID NO: 2)and the adaptor oligonucleotide that joins to the 3′ end of T1N isUR3-ext (sequence: 5′-ACCTTCAGACCAGAT-3′) (SEQ ID NO: 3).

UR3-ext contains a type IIs restriction enzyme site (Acu I: CTTCAG) toprovide a way to linearize the DNA circular for insertion of a secondadaptor. BR2-ad is annealed to BR2-temp (sequence5′-NNNNNNNGTCCGTTAATGTCCTCAG-3′) (SEQ ID NO: 4) to form adouble-stranded adaptor BR2 adaptor. UR3-ext is annealed to biotinylatedUR3-temp (sequence 5′-[BIOTIN]ATCTGGTCTGAAGGTNNNNNNN-3′) (SEQ ID NO: 5)to form a double-stranded adaptor UR3 adaptor. 1 pmol of target T1N isligated to 25 pmol of BR2 adaptor and 10 pmol of UR3 adaptor in a singleligation reaction containing 50 mM Tris-Cl, pH7.8, 10% PEG, 1 mM ATP, 50mg/L BSA, 10 mM MgCl₂, 0.3 unit/μl T4 DNA ligase (EpicentreBiotechnologies, WI) and 10 mM DTT) in a final volume of 10 μl. Theligation reaction is incubated in a temperature cycling program of 15°C. for 11 min, 37° C. for 1 min repeated 18 times. The reaction isterminated by heating at 70° C. for 10 min. Excess BR2 adaptors areremoved by capturing the ligated products with streptavidin magneticbeads (New England Biolabs, MA). 3.3 μl of 4× binding buffer (2M NaCl,80 mM Tris HCl pH 7.5) is added to the ligation reaction, which is thencombined with 15 μg of streptavidin magnetic beads in a 1× bindingbuffer (0.5M NaCl, 20 mM Tris HCl pH 7.5). After a 15 minute incubationin room temperature, the beads are washed twice with 4 volumes of lowsalt buffer (0.15M NaCl, 20 mM Tris HCl pH 7.5). Elution buffer (10 mMTris HCl pH 7.5) is pre-warmed to 70 deg, 10 μl of which is added to thebeads at 70° C. for 5 min. After magnetic separation, the supernatant isretained as primary purified sample. This sample can be further purifiedby removing the excess UR3 adaptors with magnetic beads pre-bound with abiotinylated oligonucleotide BR-rc-bio (sequence:5′-[BIOTIN]CTTTTGTCTTCCTAACATCC-3′) (SEQ ID NO: 6) that is reversecomplementary to BR2-ad similarly as described above.

The concentration of the adaptor-target ligated product in the finalpurified sample can be estimated by urea polyacrylamide gelelectrophoresis analysis. The circularization is carried out byphosphorylating the ligation products using 0.2 unit/μl T4polynucleotide kinase (Epicentre Biotechnologies) in 1 mM ATP andstandard buffer provided by the supplier, and circularized with ten-foldmolar excess of a splint oligonucleotide UR3-closing-88 (sequence5′-AGATGATAATCTGGTC-3′) (SEQ ID NO: 7) using 0.3 unit/μl of T4 DNAligase (Epicentre Biotechnologies) and 1 mM ATP. The circularizedproduct is validated by performing RCR reactions.

In another exemplary embodiment, which is illustrated in FIG. 2A,adaptor oligonucleotides (1604), are used to form (1608) a population(1608) of DNA circles by the method illustrated in FIG. 2A. In oneaspect, each member of population (1608) has an adaptor with anidentical anchor probe binding site and type IIs recognition siteattached to a DNA fragment from source nucleic acid (1600). The adaptoralso may have other functional elements including, but not limited to,tagging sequences, sequences for attachment to a solid surface,restriction sites, functionalization sequences, and the like. Classes ofDNA circles may be created by providing adaptors having different anchorprobe binding sites.

After DNA circles (FIG. (2A) 1608) are formed, further interspersedadaptors are inserted as illustrated generally in FIG. (2A) to formcircles (1612) containing interspersed adaptors. To these circles, aprimer and rolling circle replication (RCR) reagents can be added togenerate (1614) in a conventional RCR reaction a population (1616) ofconcatemers (1617) of the complements of the adaptor oligonucleotide andDNA fragments. This population can then be isolated or otherwiseprocessed (e.g. size selected) (1618) using conventional techniques,e.g. a conventional spin column, or the like, to form population (1620)for analysis.

To demonstrate that the formation of multiple-adaptor DNA circles isfeasible a synthetic target DNA of 70 bases in length and a PCR derivedfragment of 200-300 bp in length may be obtained. A single stranded PCRfragment can be simply derived from a double stranded product byphosphorylation of one of the primers and treatment with lambdaexonuclease to remove the phosphorylated strand. The single strandedfragment may be ligated to an adaptor for circularization.Polymerization, type IIs restriction enzyme digestion and re-ligationwith a new adaptor may be performed as described herein.

Demonstration that the process was successful may proceed by RCRamplification of the final derived circles. Briefly, the DNA circles areincubated with primer complementary to the last introduced adaptor andphi29 polymerase for 1 hour at 30° C. to generate a single concatemermolecule comprising hundreds of repeated copies of the original DNAcircle. Attachment of the RCR products to the surface of coverslips mayproceed by utilizing an adaptor sequence in the concatemer that iscomplementary to an attached oligonucleotide on the surface.Hybridization of adaptor unique probes may be used to demonstrate thatthe individual adaptors were incorporated into the circle and ultimatelythe RCR product. To demonstrate that the adaptors were incorporated atthe expected positions within the circle, sequence specific probes(labeled 5-mers) may be used for the synthetic or PCR derived sequencesuch that ligation may occur to an unlabeled anchor probe thatrecognizes the terminal sequence of the adaptor. Cloning and sequencingmay also be used to verify DNA integrity.

In one embodiment, a template used for circle formation can also be usedas a primer to create localized dsDNA. The schema is simplified bygenerating clean ssDNA after each circle cutting which allows the use ofthe same circle closing chemistry for each adaptor incorporations.

In one embodiment, a solution of DNA fragments with sticky ends or bluntends is prepared for making DNA circles. The traditional method to avoidmaking circles with more than one DNA molecule is to perform ligation ina large volume at a low concentration of DNA fragments whereintermolecular ligation is unlikely.

In a preferred embodiment, the ligation reaction does not require alarge volume. This embodiment involves a slow addition of aliquots ofDNA fragments into a regular size ligation reaction. Fast mixing of theDNA aliquot and the reaction minimizes multi-mer formation. The DNAfragments can be prepared in a ligation mix without ligase or in wateror TE-like buffer. Typically, the DNA volume is equal to or lower thanthe initial volume of ligation reaction. DNA may be in a large volume inwater or simple buffer (such as TE buffer) if the ligation reactionevaporates with the speed of adding the DNA sample. The evaporation maybe simplified by using thermo-stabile ligase.

In one embodiment, the method of circularization involves diluting asmall aliquot of DNA into a regular ligation reaction (such as 0.1-0.5μl in 10-50 μl provides over 100 fold dilution) and waiting forsufficient time to allow a majority of the DNA to form circles, followedby addition of a second aliquot. In another embodiment, DNA fragmentsare slowly and continuously added.

Various physical implementations of the process are possible, such asmanual or automated pipetting at a certain frequency, the use ofdrippers (gravity or positive pressure), piezo or acoustic spiting ornanodroppers, cavro-pumps that can deliver drops as small as 30 nl. Inone embodiment 10 pmols in 100 μl reaction having maximal temporalconcentration of 1 pmol/ul is processed using a consecutive addition of100 aliquots. In another embodiment, 10 pmols are in 30-50 μl aliquots.The time to circularize>70-80% of DNA fragments in one aliquot dependson ligase concentration, type of ends (sticky 1, 2, or 4 bases or blunt)and to some extent temperature (movements and hybrid stability of stickyends). In a preferred embodiment, the total time of the reaction isapproximately 4-16 hours.

In one embodiment, a ligase enzyme is immobilized on a solid support,such as beads. DNA fragments are then diffused into ligation reactionfrom a gel block or other porous container using methods known in theart. To prevent ligation between fragments (rather thancircularization), methods known in the art for temporarily blocking theDNA may be used, including but not limited to the use of non-ligatableDNA with matching sticky ends or ssDNA end binding proteins.

To increase the efficiency of flow-through of a small reaction volume,in one embodiment the reaction volume is dispensed under non-evaporatingconditions, for example by using small droplets. Non-evaporatingconditions can also be established by regulating humidity, temperatureof the support ambient, and through design of the composition ofreaction buffer. In en exemplary embodiment, 10 μl drops are dispensedby piezo spitting (˜20×20×20 microns). With no spreading this isequivalent to a 20 micron thick flow cell. Spreading can be promoted tofurther reduce thickness of the volume to about 5-10 microns. To coverone cm² using 10 μl drops with zero spreading, 100×50×50=250,000 dropscan be used.

In addition to piezo approach other forms of delivery of low amount ofbuffer per large surface can be used, such as by contacting the supportwith a porous material filled with reaction buffer or to move a longslit across the surface with a few 10-30 micron openings allowingdispensation of the buffer.

One exemplary method of circularization involves ligation of a singleadaptor to dsDNA using two blocked complementary strands. In thismethod, two complementary strands of an adaptor are independentlyprepared. A matching blocking oligo that has uracils and can not beligated to target DNA is also made for each of the two complementarystrands. A dsDNA product comprising of one adaptor strand and oneblocking oligo is assembled. Two assembled dsDNA constructs are designedthat can not ligate or hybridize one to another; the constructs may beblunt end or may have a T overhang or other overhangs for ligation toDNA targets. A mixture of these two constructs is ligated to blunt enddsDNA or DNA with corresponding sticky ends. About 50% of DNA will haveone of each construct; the other 50% will have two of the sameconstruct. The blocking oligo is then degraded, and the circle is closedby hybridization of complimentary strands and ligation.

In one embodiment, the adaptor may be palindromic to avoid distinctionof orientation. Such an approach can provide a better yield than A/Tligation approach, depending on blunt end ligation efficiency andconcentration of DNA in A/T ligation reaction. In a further embodiment,four instead of two ssDNA adaptor components are used.

Methods for Creating Concatemers

In one aspect of the invention, single molecules comprise concatemers ofpolynucleotides, usually polynucleotide analytes, i.e. target sequences,that have been produce in a conventional rolling circle replication(RCR) reaction. Guidance for selecting conditions and reagents for RCRreactions is available in many references available to those of ordinaryskill, as evidence by the following that are incorporated by reference:Kool, U.S. Pat. No. 5,426,180; Lizardi, U.S. Pat. Nos. 5,854,033 and6,143,495; Landegren, U.S. Pat. No. 5,871,921; and the like. Generally,RCR reaction components comprise single stranded DNA circles, one ormore primers that anneal to DNA circles, a DNA polymerase having stranddisplacement activity to extend the 3′ ends of primers annealed to DNAcircles, nucleoside triphosphates, and a conventional polymerasereaction buffer. Such components are combined under conditions thatpermit primers to anneal to DNA circles and be extended by the DNApolymerase to form concatemers of DNA circle complements. An exemplaryRCR reaction protocol is as follows: In a 50 μL reaction mixture, thefollowing ingredients are assembled: 2-50 pmol circular DNA, 0.5units/μL phage φ29 DNA polymerase, 0.2 μg/μL BSA, 3 mM dNTP, 1× φ29 DNApolymerase reaction buffer (Amersham). The RCR reaction is carried outat 30° C. for 12 hours. In some embodiments, the concentration ofcircular DNA in the polymerase reaction may be selected to be low(approximately 10-100 billion circles per ml, or 10-100 circles perpicoliter) to avoid entanglement and other intermolecular interactions.

Preferably, concatemers produced by RCR are approximately uniform insize; accordingly, in some embodiments, methods of making arrays of theinvention may include a step of size-selecting concatemers. For example,in one aspect, concatemers are selected that as a population have acoefficient of variation in molecular weight of less than about 30%; andin another embodiment, less than about 20%. In one aspect, sizeuniformity is further improved by adding low concentrations of chainterminators, such ddNTPs, to the RCR reaction mixture to reduce thepresence of very large concatemers, e.g. produced by DNA circles thatare synthesized at a higher rate by polymerases. In one embodiment,concentrations of ddNTPs are used that result in an expected concatemersize in the range of from 50-250 Kb, or in the range of from 50-100 Kb.In another aspect, concatemers may be enriched for a particular sizerange using a conventional separation techniques, e.g. size-exclusionchromatography, membrane filtration, or the like.

An exemplary method for producing concatemers is illustrated in FIG. 2A.After DNA circles (1608) are formed, further interspersed adaptors areinserted as illustrated generally in FIG. (2A) to form circles (1612)containing interspersed adaptors. To these circles, a primer and rollingcircle replication (RCR) reagents can be added to generate (1614) in aconventional RCR reaction a population (1616) of concatemers (1617) ofthe complements of the adaptor oligonucleotide and DNA fragments. Thispopulation can then be isolated or otherwise processed (e.g. sizeselected) (1618) using conventional techniques, e.g. a conventional spincolumn, and the like, to form population (1620) for analysis.

Target polynucleotides may be generated from a source nucleic acid, suchas genomic DNA, by fragmentation to produce fragments 0.2-2 kb in size,or more preferably, 0.3-0.6 kb in size, which then may be circularizedfor an RCR reaction.

In another aspect, the invention provides methods and compositions forgenerating concatemers of a plurality of target polynucleotidescontaining interspersed adaptors. In one embodiment, such concatemersmay be generated by RCR, as illustrated in FIGS. 1C-1D.

Rolling circle replication is a preferred method of creating concatemersof the invention. The RCR process has been shown to generate multiplecontinuous copies of the M13 genome. (Blanco, et al., (1989) J Biol Chem264:8935-8940). In this system, the desired DNA fragment is “cloned”into a DNA adaptor and replicated by linear concatemerization. Thetarget DNA is immediately in a form suitable for hybridization andenzymatic methodologies without the need to passage through bacteria.

The RCR process relies upon the desired target molecule first beingformed into a circular substrate. This linear amplification uses theoriginal DNA molecule, not copies of a copy, thus ensuring fidelity ofsequence. As a circular entity, the molecule acts as an endless templatefor a strand displacing polymerase that extends a primer complementaryto a portion of the circle. The continuous strand extension createslong, single-stranded DNA consisting of hundreds of concatemerscomprising multiple copies of sequences complementary to the circle.

Methods for Creatin Arrays

In one embodiment, emulsion PCR is used to generate amplicons fordisposal onto an array. As illustrated in FIG. (1B) after breakingemulsion (1505), beads containing clones of the adaptored sequences maybe arrayed (1520) on a solid surface (1522) for sequence analysis. Sucharray of beads may be random, as illustrated in FIG. 1F, where thelocations of the beads are not determined prior to arraying, or thearray may be in accordance with a predetermined pattern of binding sites(1524), even though the distribution of beads on such sites is randomlydetermined. Both of such distributions are referred to herein as “randomarrays.”

To achieve compact, dense bundles of the DNA in the form of sub-micronspots, a region of the amplified molecule for hybridization to a captureprobe attached to the glass surface can be utilized. Hundreds of captureprobe molecules (spaced about 10 nm apart) can keep hundreds ofconcatenated copies of a target molecule tightly bound to a glasssurface area of less than 500 nm in diameter. In one embodiment, glassactivation chemistry is applied that creates a monolayer ofisothiocyanate reactive groups for attaching amine modified captureoligonucleotides.

Generally, densities of single molecules are selected that permit atleast twenty percent, or at least thirty percent, or at least fortypercent, or at least a majority of the molecules to be resolvedindividually by the signal generation and detection systems used. In oneaspect, a density is selected that permits at least seventy percent ofthe single molecules to be individually resolved. In one aspect,whenever scanning electron microscopy is employed, for example, withmolecule-specific probes having gold nanoparticle labels, e.g. Nie et al(2006), Anal. Chem., 78: 1528-1534, which is incorporated by reference,a density is selected such that at least a majority of single moleculeshave a nearest neighbor distance of 50 nm or greater; and in anotheraspect, such density is selected to ensure that at least seventy percentof single molecules have a nearest neighbor distance of 100 nm orgreater. In another aspect, whenever optical microscopy is employed, forexample with molecule-specific probes having fluorescent labels, adensity is selected such that at least a majority of single moleculeshave a nearest neighbor distance of 200 nm or greater; and in anotheraspect, such density is selected to ensure that at least seventy percentof single molecules have a nearest neighbor distance of 200 nm orgreater. In still another aspect, whenever optical microscopy isemployed, for example with molecule-specific probes having fluorescentlabels, a density is selected such that at least a majority of singlemolecules have a nearest neighbor distance of 300 nm or greater; and inanother aspect, such density is selected to ensure that at least seventypercent of single molecules have a nearest neighbor distance of 300 nmor greater, or 400 nm or greater, or 500 nm or greater, or 600 nm orgreater, or 700 nm or greater, or 800 nm or greater. In still anotherembodiment, whenever optical microscopy is used, a density is selectedsuch that at least a majority of single molecules have a nearestneighbor distance of at least twice the minimal feature resolution powerof the microscope. In another aspect, polymer molecules of the inventionare disposed on a surface so that the density of separately detectablepolymer molecules is at least 1000 per μm², or at least 10,000 per μm²,or at least 100,000 per μm².

In another aspect of the invention, the requirement of selectingdensities of randomly disposed single molecules to ensure desirednearest neighbor distances is obviated by providing on a surfacediscrete spaced apart regions that are substantially the sole sites forattaching single molecules. That is, in such embodiments the regions onthe surface between the discrete spaced apart regions, referred toherein as “inter-regional areas,” are inert in the sense thatconcatemers, or other macromolecular structures, do not bind to suchregions. In some embodiments, such inter-regional areas may be treatedwith blocking agents, e.g. DNAs unrelated to concatemer DNA, otherpolymers, and the like. Generally, the area of discrete spaced apartregions is selected, along with attachment chemistries, macromolecularstructures employed, and the like, to correspond to the size of singlemolecules of the invention so that when single molecules are applied tosurface substantially every region is occupied by no more than onesingle molecule. The likelihood of having only one single molecule perdiscrete spaced apart region may be increased by selecting a density ofreactive functionalities or capture oligonucleotides that results infewer such moieties than their respective complements on singlemolecules. Thus, a single molecule will “occupy” all linkages to thesurface at a particular discrete spaced apart region, thereby reducingthe chance that a second single molecule will also bind to the sameregion. In particular, in one embodiment, substantially all the captureoligonucleotides in a discrete spaced apart region hybridize to adaptoroligonucleotides a single macromolecular structure. In one aspect, adiscrete spaced apart region contains a number of reactivefunctionalities or capture oligonucleotides that is from about tenpercent to about fifty percent of the number of complementaryfunctionalities or adaptor oligonucleotides of a single molecule. Thelength and sequence(s) of capture oligonucleotides may vary widely, andmay be selected in accordance with well known principles, e.g. Wetmur,Critical Reviews in Biochemistry and Molecular Biology, 26: 227-259(1991); Britten and Davidson, chapter 1 in Hames et al, editors, NucleicAcid Hybridization: A Practical Approach (IRL Press, Oxford, 1985). Inone aspect, the lengths of capture oligonucleotides are in a range offrom 6 to 30 nucleotides, and in another aspect, within a range of from8 to 30 nucleotides, or from 10 to 24 nucleotides. Lengths and sequencesof capture oligonucleotides are selected (i) to provide effectivebinding of macromolecular structures to a surface, so that losses ofmacromolecular structures are minimized during steps of analyticaloperations, such as washing, etc., and (ii) to avoid interference withanalytical operations on analyte molecules, particularly when analytemolecules are DNA fragments in a concatemer. In regard to (i), in oneaspect, sequences and lengths are selected to provide duplexes betweencapture oligonucleotides and their complements that are sufficientlystable so that they do not dissociate in a stringent wash. In regard to(ii), if DNA fragments are from a particular species of organism, thendatabases, when available, may be used to screen potential capturesequences that may form spurious or undesired hybrids with DNAfragments. Other factors in selecting sequences for captureoligonucleotides are similar to those considered in selecting primers,hybridization probes, oligonucleotide tags, and the like, for whichthere is ample guidance, as evidenced by the references cited below inthe Definitions section.

In one aspect, the area of discrete spaced apart regions is less than 1μm²; and in another aspect, the area of discrete spaced apart regions isin the range of from 0.04 μm² to 1 μm²; and in still another aspect, thearea of discrete spaced apart regions is in the range of from 0.2 μm² to1 μm². In another aspect, when discrete spaced apart regions areapproximately circular or square in shape so that their sizes can beindicated by a single linear dimension, the size of such regions are inthe range of from 125 nm to 250 nm, or in the range of from 200 nm to500 nm. In one aspect, center-to-center distances of nearest neighborsof such regions are in the range of from 0.25 μm to 20 μm; and inanother aspect, such distances are in the range of from 1 μm to 10 μm,or in the range from 50 to 1000 nm. Preferably, spaced apart regions forimmobilizing concatemers are arranged in a rectilinear or hexagonalpattern.

In one embodiment, spacer DNBs are used to prepare a surface forattachment of test DNBs. The surface is first covered by the captureoligonucleotide complementary to the binding site present on two typesof synthetic DNBs; one is a capture DNB, the other is a spacer DNB. Thespacer DNBs do not have DNA segments complementary to the adaptor usedin preparation of test DNBs and they are used in about 5-50, preferably10× excess to capture DNBs. The surface with capture oligonucleotide is“saturated” with a mix of synthetic DNBs (prepared by chain ligation orby RCR) in which the spacer DNBs are used in about 10-fold (or 5 to50-fold) excess to capture DNBs. Because of the ˜10:1 ratio betweenspacer and capture DNBs, the capture DNBs are mostly individual islandsin a sea of spacer DNBs. The 10:1 ratio provides that two capture DNBsare on average separated by two spacer DNBs. If DNBs are about 200 nm indiameter, then two capture DNBs are at about 600 nm center-to-centerspacing. This surface is then used to attach test DNBs or othermolecular structures that have a binding site complementary to a regionof the capture DNBs but not present on the spacer DNBs.

Capture DNBs may be prepared to have fewer copies than the number ofbinding sites in test DNBs to assure single test DNB attachment percapture DNB spot. Because the test DNA can bind only to capture DNBs, anarray of test DNBs may be prepared that have high site occupancy withoutcongregation. Due to random attachment, some areas on the surface maynot have any DNBs attached, but these areas with free captureoligonucleotide may not be able to bind test DNBs since they aredesigned not to have binding sites for the capture oligonucleotide.Arrays of the invention may or may not be arranged in a grid pattern.

In one aspect, a high density array of capture oligonucleotide spots ofsub micron size is prepared using a printing head or imprint-masterprepared from a bundle, or bundle of bundles, of about 10,000 to 100million optical fibers with a core and cladding material. By properpulling and fusing fibers, a unique material may be produced that hasabout 50-1000 nm cores separated by a similar or 2-5 fold smaller orlarger size cladding material. In one embodiment, differential etching(dissolving) of cladding material provides a nano-printing head having avery large number of nano-sized posts. This printing head may be usedfor depositing oligonucleotides or other biological (proteins,oligopeptides, DNA, aptamers) or chemical compounds such as silane withvarious active groups.

In one embodiment the glass fiber tool may be used as a patternedsupport to deposit oligonucleotides or other biological or chemicalcompounds. In this case only posts created by etching may be contactedwith material to be deposited. In another embodiment, a flat cut of thefused fiber bundle may be used to guide light through cores and allowlight-induced chemistry to occur only at the tip surface of the cores,thus eliminating the need for etching. In both embodiments, the samesupport may then be used as a light guiding/collection device forimaging fluorescence labels used to tag oligonucleotides or otherreactants. This device provides a large field of view with a largenumerical aperture (potentially >1).

Stamping or printing tools that perform active material oroligonucleotide deposition may be used to print 2 to 100 differentoligonucleotides in an interleaved pattern. This type of oligonucleotidearray may be used for attaching 2 to 100 different DNA populations, suchas populations derived from different source DNA. They also may be usedfor parallel reading from sub-light resolution spots by using DNAspecific anchors or tags. Information can be accessed by DNA specifictags, e.g. 16 specific anchors for 16 DNAs and read 2 bases by acombination of 5-6 colors and using 16 ligation cycles or one ligationcycle and 16 decoding cycles.

In embodiments of the invention, photolithography, electron beamlithography, nano imprint lithography, and nano printing may be used togenerate such patterns on a wide variety of surfaces, e.g. Pirrung etal, U.S. Pat. No. 5,143,854; Fodor et al, U.S. Pat. No. 5,774,305; Guo,(2004) Journal of Physics D: Applied Physics, 37: R123-141; which areincorporated herein by reference. These techniques can be used togenerate patterns of features on the order of 1/10^(th) of a micron andhave been developed for use in the semiconductor industry. In apreferred embodiment, a single “masking” operation is performed on theDNA array substrate, as opposed to the 20 to 30 masking operationstypically needed to create even a simple semiconductor. Using a singlemasking operation eliminates the need for the accurate alignment of manymasks to the same substrate. There is also no need for doping ofmaterials. Minor defects in the pattern may have little to no effect onthe usability of the array, thus allowing production yields to approach100%.

In one embodiment, high density structured random DNA array chips havecapture oligonucleotides concentrated in small, segregated capture cellsaligned into a rectangular grid formation (FIG. 4). Preferably, eachcapture cell or binding site is surrounded by an inert surface and mayhave a sufficient but limited number of capture molecules (100-400).Each capture molecule may bind one copy of the matching adaptor sequenceon the RCR produced DNA concatemer. Since each concatemer contains over1000 copies of the adaptor sequence, it is able to quickly saturate thebinding site upon contact and prevent other concatemers from binding,resulting in exclusive attachment of one RCR product per binding site orspot. By providing enough RCR products almost every spot on the arraymay contain one and only one unique DNA target.

RCR “molecular cloning” allows the application of thesaturation/exclusion (single occupancy) principle in making randomarrays. The exclusion process is not feasible in making single moleculearrays if an in situ amplification is alternatively applied. RCRconcatemers provide an optimal size to form small non-mixed DNA spots.Each concatemer of about 100 kb is expected to occupy a space of about0.1×0.1×0.1 μm, thus allowing RCR products to fit into 100 nm capturecells. One advantage of RCR products is that the single stranded DNA isready for hybridization and is very flexible for forming a randomlycoiled ball of DNA. The 1000 copies of DNA target produced by RCRprovide much higher specificity than is possible with analysis of asingle molecule.

There are methods known in the art for generating a patterned DNA chip.In a preferable embodiment, all spots on the chip have the same captureoligonucleotides and a 0.2-0.3 micron spot size at 0.5 micron pitch.Nano-printing approaches may be used for producing such patterns, asthey do not require development of new oligonucleotide attachmentchemistry.

Nano-imprint technologies rely on classic photolithographic techniquesto produce a master mold. The master mold is then replicated usingpolymers such as PMMA or PDMS. These polymers, upon curing, form anegative mold of the master. The mold is then used to “print” patternsof material on a substrate. The nano-imprint technique can be used tocreate protein features on glass, silicon, and gold surfaces. In anexemplary embodiment, a master mold is used to generate many stampingdevices and each stamping device can generate many prints of chemicals(such as oligonucleotide solution, oligonucleotide binding or glassactivation chemicals). Advanced nano-printing techniques can producefeatures as small as 10 nm, thus, features appropriate for fluorescentdetection that are >200 nm in size, including features 300-500 nm at1000 microns center to center, can be produced routinely.

Various chemical modifications can be used to alter surface properties,increasing the compatibility of the master mold with a wide range ofmaterials, thus allowing the use of a small feature, low-density mold tocreate high density arrays. In one embodiment, a mold with a 4 umfeature pitch can be used to create a one um feature pitch on thesubstrate by printing the same substrate 16 times in a 4 by 4 grid.

In one aspect, a method of creating DNA arrays involves the use of athin layer of photo-resist to protect portions of the substrate surfaceduring a functionalization process. The patterned photo-resist isremoved after functionalization, leaving an array of activated areas.The second approach involves attaching a monolayer of modifiedoligonucleotides to the substrate. The oligonucleotides are modifiedwith a photo-cleavable protecting group. These protecting groups can beremoved by exposure to an illumination source, allowing patternedligation of a capture oligonucleotide for attachment of DNBs byhybridization.

In another embodiment, a commercially available, optically flat, quartzwafer is spin coated with a 100-500 nm thick layer of photo-resist. Thephoto-resist is baked on to the quartz wafer, and an image of a reticlewith a pattern of spots to be activated is projected onto the surface ofthe photo-resist, using a machine commonly called a stepper. Afterexposure, the photo-resist is developed, removing the areas of theprojected pattern which were exposed to the UV source. This isaccomplished by plasma etching, a dry developing technique capable ofproducing very fine detail. The wafer is then baked to strengthen theremaining photo-resist.

After baking, the quartz wafer is ready for functionalization. The waferis then subjected to vapor-deposition of3-aminopropyldimethylethoxysilane, the same monomer used in the currentfunctionalization process. The density of the amino functionalizedmonomer can be tightly controlled by varying the concentration of themonomer and the time of exposure of the substrate. Only areas of quartzexposed by the plasma etching process may react with and capture themonomer. The wafer is then baked again to cure the monolayer ofamino-functionalized monomer to the exposed quartz. After baking, theremaining photo-resist may be removed using acetone. Because of thedifference in attachment chemistry between the resist and silane,aminosilane-functionalized areas on the substrate may remain intactthrough the acetone rinse. These areas can be further functionalized byreacting them with p-phenylenediisothiocyanate in a solution of pyridineand N—N-DiMethlyFormamide. The substrate may then be compatible withamine-modified oligonucleotides. Alternatively, oligonucleotides can beprepared with a 5′-carboxy-modifier-c10 (Glen Research:http://www.glenres.com/ProductFiles/10-1935.html). This technique allowsthe oligonucleotide to be attached directly to the amine modifiedsupport, thereby avoiding additional functionalization steps.

In another embodiment, a nano-imprint lithography (NIL) process is usedwhich starts with the production of a master imprint tool. This tool isproduced using high-resolution e-beam lithography, and can be used tocreate a large number of imprints, depending on the NIL polymerutilized. For DNA array production, the quartz substrate would be spincoated with a layer of resist, this layer commonly called the transferlayer. A second type of resist is then applied over the transfer layer,this layer is commonly called the imprint layer. The master imprint toolthen makes an impression on the imprint layer. The overall thickness ofthe imprint layer is then reduced by plasma etching until the low area'sof the imprint reach the transfer layer. Because the transfer layer isharder to remove than the imprint layer, it remains largely untouched.The imprint and transfer layers are then hardened by heating. Thesubstrate is then put back into the plasma etcher until the low areas ofthe imprint reach the quartz. The substrate is then derivatized by vapordeposition as described in method 1a.

In another embodiment, a nano-printing method is used. Such a processuses photo, imprint, or e-beam lithography to create a master mold.There are many variations on the techniques used to manufacture thenano-imprint tools. In one exemplary method, the master mold is createdas a negative image of the features required on the print head. Theprint heads are usually made of a soft, flexible polymer such aspolydimethylsiloxane (PDMS). This material, or layers of materialshaving different properties, are spin coated onto a quartz substrate.The mold is then used to emboss the features onto the top layer ofresist material under controlled temperature and pressure conditions.The print head is then subjected to a plasma based etching process toimprove the aspect ratio of the print head, and eliminate distortion ofthe print head due to relaxation over time of the embossed material. Theprint head is used to deposit a pattern of amine modifiedoligonucleotides onto a homogenously derivatized surface. Theseoligo-nucleotides serve as capture probes for the DNB's. One advantageto nano-printing is the ability to print interleaved patterns ofdifferent capture probes onto the random array support. This can beaccomplished by successive printing with multiple print heads, each headhaving a differing pattern, and all patterns fitting together to formthe final structured support pattern. Such methods allow for positionalencoding of DNA elements within the random array. For example, controlDNBs containing a specific anchor sequence can be bound at regularintervals throughout a random array.

Electron beam lithography can also be used to create the substrate. Thisprocess is very similar to photolithography, except the pattern is drawndirectly on a special resist material using an electron beam gun. Thebenefit of this process is that the feature size can be much smaller andmore precise than with UV photolithographic methods. A potentialdrawback is the amount of time required to create the pattern is on theorder of hours per substrate, as opposed to a couple of seconds usingphotolithographic methods or less than a minute for NIL.

In one embodiment, the arrays are produced using photo-cleavablemodifiers, also referred to as protecting groups. In such a method,capture cells can be created by using commercially availablephoto-cleavable modifiers to oligonucleotides, such as the PC LinkerPhosphoramidite, available from Glen Research. An oligonucleotide with a5 prime photo-cleavable protection group, in this case DMTO, is attachedto a fully functionalized piece of quartz at the 3′ terminus. Theexposed areas lose their protecting group, leaving a 5′ phosphate. Usingoligonucleotide ligation, a capture oligonucleotide complementary to theadaptor region of RCR products is ligated to exposed phosphate groups ifa template oligonucleotide is provided as depicted below:

(oligonucleotide on the surface) |------cttactgtgc (SEQ. ID No.10) -POH-ggactaccgtttagg..cccgtgg (SEQ. ID NO.11) (capture oligonucleotide)gaatgacacg (SEQ. ID NO.12) ......cctgatggca (SEQ. ID NO.13) (singletemplate oligonucleotide;)

After ligation of the capture oligonucleotide to the deprotected surfaceoligonucleotides, the entire substrate can be exposed to a UV source toremove the remaining protecting groups. The free phosphate groups may beblocked by ligating hairpin like oligonucleotides to prevent ligation oflabeled probes used in the sequencing process to the supportoligonucleotide.

The photo-resist material used in fabrication methods is generally quitehydrophobic, and the patterns made in that material consist of verysmall holes. It is possible that the exposed surface of the quartz maynot come into contact with aqueous solutions of the amino functionalizedmonomer due to the hydrophobic effect of the photo-resist. To avoid thisproblem, one embodiment of the invention is to use ultrasound to forcethe liquid past the small openings in the mask. It is also possible toput a small amount of surfactant, acetone, or other additive to thesolution to break the surface tension of the water. The use of solventsin this manner might swell the mask material slightly, but it would notdissolve it. In the event that the resist material is incompatible withthe amino-functionalized surface during the resist removal process, forinstance it might react with and destroy the amine, it is possible toperform a mechanical peel of the resist material using a strong acrylicbased adhesive on a polymer sheet.

After each batch of DNA array substrates is made, it may be important todetermine if the batch is up to specification. Specifications may bedetermined during the mask design and biochemistry optimization phase.Quality control of each batch of substrates can be performed byattaching FITC or a amine-modified oligonucleotide with any fluorescentlabel to the reactive surface and observing the intensity and pattern ofthe fluorescence on the substrate surface. The overall intensity of theactive regions may be proportional to the density of reactive sites inthe capture cells. The current microscopy system has a 100×, 1.4 NA lensthat has a theoretical resolving power of about 180 nm. The sensitivityof the current image acquisition system is about 3 dye molecules perpixel, with each pixel imaging a 60×60 nm area of the substrate. It isexpected to be able to attach between 10-50 capture oligonucleotides per60 nm square area. This allows directly measuring, with high accuracy,the attachment efficiency and grid properties of the substrate. Eachcapture cell may be imaged by roughly 10 pixels.

Using the QC data, it is possible to determine which substratepreparation steps need improvement. Intensity variation between capturecells, at this point in the process, would point to uneven reactionconditions during the functionalization process or non-uniformdevelopment of the photo-resist layer. If there is bridging betweencells, it would suggest that the photo-resist material delaminated fromthe surface of the quartz, or that something went wrong during theexposure process. Problems with signal intensity would point to poorcontrol of the functionalization step. Additional metrics maynecessarily be developed as the process matures.

Replica Arrays

In one aspect of the invention, complementary polynucleotidessynthesized on a master array are transferred to a replica array. Toachieve such a transfer, two surfaces may be contacted in the presenceof heating to denature dsDNA and free newly made DNA strands. In anotherembodiment, the transfer is achieved by applying an electric field todiscriminatively transfer only the replicated DNA that has about 5-50times more charge than primers. In a further embodiment, afterhybridizing the transferred strand a reverse field is combined with areduction in temperature to move primers back to the master array. In anembodiment in which the transfer is achieved by applying an electricfield, porous glass is preferably used to allow the application of theelectric field.

In one embodiment, a capture oligonucleotide is designed to correspondto the end of an amplicon opposite to the priming site to assureexclusive retention of the full length copies. Having a pattern of nineor more different capture oligonucleotides minimizes the chance of“cross talk” during DNA transfer from the master array. In oneembodiment, the transfer is achieved without further amplification ofDNA on the replica array; multiple transfers to the same replica mayalso be used to generate a stronger signal. In another embodiment,multiple replicas may be generated by partial transfer from the masterarray, with DNA amplification performed in each replica array.

In an exemplary embodiment, the substrate for the replica array containsprimers for initiating DNA synthesis using template DNA attached on thefirst array. After contacting surfaces of the master array and supportof the “to be formed” replica array in the presence of DNA polymerase,dNTPs and suitable buffer at optimum temperature, primer moleculeshybridize to the template DNA on the master array and become extended bythe polymerase. A stopping agent such as dsDNA may be used to stop DNAat the end of one copy. By increasing temperature, or by using other DNAdenaturing agents, DNA strands may separate and the replica array can beseparated form the first array. To prevent removal of original DNA fromthe master array, the original DNA may be directly (or indirectly viacapture oligonucleotide) covalently attached to the master arraysupport.

Any incomplete DNA that is attached to the replica array may bespecifically removed after completion of the replication reaction usingvarious methods known in the art, such as through protective ligation ofthe completed molecules that have specific ends—the incomplete moleculescan then be removed without losing the completed molecules.

In one embodiment, primers cover the entire substrate surface for arraypreparation. A primer density of 10,000 per micron square provides alocal concentration in one micron, between two supports, of similar orabout 10 times higher concentration than used in PCR. Primers may havevery long attachment linkers to be able to reach to the DNA template onthe first array's support. In this process there is no possibility forDNA diffusion and replica DNA spots may be only slightly larger thanoriginal spots. A very flat surface may be used to assure closeproximity of two surfaces. In one embodiment, DNBs provide enough DNAloops of about 300-500 nm and when combined with 100 nm primer linkers,allowing tolerance of surface imperfections.

Replica arrays may be used to produce additional replicas. Secondgeneration replicas would have the same DNA strand as the originalarray.

Replica arrays may be used for parallel analysis of the same set of DNAfragments such as hybridization with a large number of probes or probepools. In another embodiment, self-assembled DNA master chips containinggenomic fragments may be replicated to generate many detection arraysthat do not need to be decoded because they match the same master chipthat was already decoded. Thus, replication of arrays allows uspreparation of self-assembled DNA arrays with minimal decoding costs,because one master and its replicas may be used to produce thousands offinal arrays.

Structure of Capture Oligos

In one embodiment, surface (FIGS. 1C & D—1622) may have attached captureoligonucleotides that form complexes, e.g. double stranded duplexes,with a segment of an adaptor oligonucleotide in the concatemers, such asan anchor binding site or other elements. In other embodiments, captureoligonucleotides may comprise oligonucleotide clamps, or likestructures, that form triplexes with adaptor oligonucleotides, e.g.Gryaznov et al, U.S. Pat. No. 5,473,060. In another embodiment, surface(1622) may have reactive functionalities that react with complementaryfunctionalities on the concatemers to form a covalent linkage, e.g. byway of the same techniques used to attach cDNAs to microarrays, e.g.Smirnov et al (2004), Genes, Chromosomes & Cancer, 40: 72-77; Beaucage(2001), Current Medicinal Chemistry, 8: 1213-1244, which areincorporated herein by reference.

In one aspect, when enzymatic processing is not required, captureoligonucleotides may comprise non-natural nucleosidic units and/orlinkages that confer favorable properties, such as increased duplexstability; such compounds include, but not limited to, peptide nucleicacids (PNAs), locked nucleic acids (LNA), oligonucleotide N3′→P5′phosphoramidates, oligo-2′-O-alkylribonucleotides, and the like.

Structure of Random Arrays

In one aspect, concatemers (1620—FIGS. 1C & D) may be fixed to surface(1622) by any of a variety of techniques, including covalent attachmentand non-covalent attachment. In one embodiment, surface (1622) may haveattached capture oligonucleotides that form complexes, e.g. doublestranded duplexes, with a segment of an adaptor oligonucleotide in theconcatemers, such as an anchor binding site or other elements. In otherembodiments, capture oligonucleotides may comprise oligonucleotideclamps, or like structures, that form triplexes with adaptoroligonucleotides, e.g. Gryaznov et al, U.S. Pat. No. 5,473,060. Inanother embodiment, surface (1622) may have reactive functionalitiesthat react with complementary functionalities on the concatemers to forma covalent linkage, e.g. by way of the same techniques used to attachcDNAs to microarrays, e.g. Smirnov et al (2004), Genes, Chromosomes &Cancer, 40: 72-77; Beaucage (2001), Current Medicinal Chemistry, 8:1213-1244, which are incorporated herein by reference. Long DNAmolecules, e.g. several hundred nucleotides or larger, may also beefficiently attached to hydrophobic surfaces, such as a clean glasssurface that has a low concentration of various reactivefunctionalities, such as —OH groups.

In one embodiment, complete genome sequencing uses an array comprising a50 to 200× genome coverage of the analyzed polynucleotide fragments. Forexample 6 billion DNBs with an average fragment length of 100 baseswould contain 600 billion bases representing 100× genome coverage. Inone embodiment, the array comprises 6 billion DNBs composed of 300-600base long DNA fragments. The DNBs may be bound to the array substrate ina square pack arrangement at a pitch of one micron and the arraysubstrate may be split across 16 segments. In a further embodiment, eachsegment contains 24 unit sub arrays with each unit sub array containing16 million bound DNBs over a 2×2 square millimeter area.

A sequencing assay which uses 8 segments and DNB's 250 bases long mayrequire 350 probe pools for sequencing. Various tradeoffs betweenfragment length, DNB count, pool sets, and overlap can be made tooptimize sequence quality versus imaging time. For example, the samerandom array segmented into 16 segments may require 225 probe pools forsequencing. This would require fewer probe pool cycles, reducing imagingtime. Additionally, DNBs can be composed of 500 base long fragments,requiring 3 billion DNB's to be assayed against 350 probe pools using 16segments tested in 16 reaction chambers. This format would produce arandom array with 256× genome coverage, thus reducing the unit arraysize to two square millimeters. In one embodiment, each probe pool iscombinatorially labeled using 2 of 6 fluorophores producing up to 21possible fluorescent label combinations. This labeling schema allowsassaying against many probes simultaneously, reducing hybridization timeby an order of magnitude.

A wide variety of supports may be used for arrays of the invention. Inone aspect, supports are rigid solids that have a surface, preferably asubstantially planar surface so that single molecules to be interrogatedare in the same plane. The latter feature permits efficient signalcollection by detection optics.

In another aspect, solid supports of the invention are nonporous,particularly when random arrays of single molecules are analyzed byhybridization reactions requiring small volumes. Suitable solid supportmaterials include materials such as glass, polyacrylamide-coated glass,ceramics, silica, silicon, quartz, various plastics, and the like.

In one aspect, the area of a planar surface may be in the range of from0.5 to 4 cm². In one aspect, the solid support is glass or quartz, suchas a microscope slide, having a surface that is uniformly silanized.This may be accomplished using conventional protocols, e.g. acidtreatment followed by immersion in a solution of 3-glycidoxypropyltrimethoxysilane, N,N-diisopropylethylamine, and anhydrous xylene(8:1:24 v/v) at 80° C., which forms an epoxysilanized surface. e. g.Beattie et a (1995), Molecular Biotechnology, 4: 213. Such a surface isreadily treated to permit end-attachment of capture oligonucleotides,e.g. by providing capture oligonucleotides with a 3′ or 5′ triethyleneglycol phosphoryl spacer prior to application to the surface. Many otherprotocols may be used for adding reactive functionalities to glass andother surfaces, as evidenced by the disclosure in Beaucage (citedabove).

Arrays of DNA targets with interspersed adaptor(s) are not limited tosingle molecule or concatemers, and can include arrays of in situamplified DNA spots or arrays of particles, each comprising multiplecopies of a target nucleic acid (for example beads used inemulsion-PCR). Furthermore, methods as described herein which utilizemultiple anchors or primers that can be differentially removed orotherwise discriminated are not limited to interspersed adaptors, i.e.they can be accomplished on samples with two “standard”, i.e.end-ligated adaptors having a total of 4 anchor sites.

Structure of Probes

The term “probes” is used in a broad sense of oligonucleotides used indirect hybridization, or as in ligation of two probes, or as in probewith an anchor, or as in a probe with an anchor probe. Probes may haveonly a few specific bases and many degenerate bases: for exampleBNNNNNNN or BBNNNNNN or NNBBNNNN. Anchor probes may be designed asU5-10B1-4 to read 1-4 bases adjacent to an adaptor sequencecomplementary to an anchor U5-10 sequence.

The oligonucleotide probes of the invention can be labeled in a varietyof ways, including the direct or indirect attachment of radioactivemoieties, fluorescent moieties, calorimetric moieties, chemiluminescentmoieties, and the like. Many comprehensive reviews of methodologies forlabeling DNA and constructing DNA adaptors provide guidance applicableto constructing oligonucleotide probes of the present invention. Suchreviews include Kricka, Ann. Clin. Biochem., 39: 114-129 (2002);Schaferling et al, Anal. Bioanal. Chem., (Apr. 12, 2006); Matthews etal, Anal. Biochem., Vol 169, pgs. 1-25 (1988); Haugland, Handbook ofFluorescent Probes and Research Chemicals, Tenth Edition(Invitrogen/Molecular Probes, Inc., Eugene, 2006); Keller and Manak, DNAProbes, 2nd Edition (Stockton Press, New York, 1993); and Eckstein,editor, Oligonucleotides and Analogues: A Practical Approach (IRL Press,Oxford, 1991); Wetmur, Critical Reviews in Biochemistry and MolecularBiology, 26: 227-259 (1991); Hermanson, Bioconjugate Techniques(Academic Press, New York, 1996); and the like. Many more particularmethodologies applicable to the invention are disclosed in the followingsample of references: Fung et al, U.S. Pat. No. 4,757,141; Hobbs, Jr.,et al U.S. Pat. No. 5,151,507; Cruickshank, U.S. Pat. No. 5,091,519;(synthesis of functionalized oligonucleotides for attachment of reportergroups); Jablonski et al, Nucleic Acids Research, 14: 6115-6128 (1986)(enzyme-oligonucleotide conjugates); Ju et al, Nature Medicine, 2:246-249 (1996); Bawendi et al, U.S. Pat. No. 6,326,144 (derivatizedfluorescent nanocrystals); Bruchez et al, U.S. Pat. No. 6,274,323(derivatized fluorescent nanocrystals); and the like.

In one aspect, one or more fluorescent dyes are used as labels for theoligonucleotide probes, e.g. as disclosed by Menchen et al, U.S. Pat.No. 5,188,934 (4,7-dichlorofluorscein dyes); Begot et al, U.S. Pat. No.5,366,860 (spectrally resolvable rhodamine dyes); Lee et al, U.S. Pat.No. 5,847,162 (4,7-dichlororhodamine dyes); Khanna et al, U.S. Pat. No.4,318,846 (ether-substituted fluorescein dyes); Lee et al, U.S. Pat. No.5,800,996 (energy transfer dyes); Lee et al, U.S. Pat. No. 5,066,580(xanthene dyes): Mathies et al, U.S. Pat. No. 5,688,648 (energy transferdyes); and the like. Labeling can also be carried out with quantum dots,as disclosed in the following patents and patent publications,incorporated herein by reference: U.S. Pat. Nos. 6,322,901; 6,576,291;6,423,551; 6,251,303; 6,319,426; 6,426,513; 6,444,143; 5,990,479;6,207,392; 2002/0045045; 2003/0017264; and the like. As used herein, theterm “fluorescent signal generating moiety” means a signaling meanswhich conveys information through the fluorescent absorption and/oremission properties of one or more molecules. Such fluorescentproperties include fluorescence intensity, fluorescence life time,emission spectrum characteristics, energy transfer, and the like.

Commercially available fluorescent nucleotide analogues readilyincorporated into the labeling oligonucleotides include, for example,Cy3-dCTP, Cy3-dUTP, Cy5-dCTP, Cy5-dUTP (Amersham Biosciences,Piscataway, N.J., USA), fluorescein-12-dUTP,tetramethylrhodamine-6-dUTP, Texas Red®-5-dUTP, Cascade Blue®-7-dUTP,BODIPY®FL-14-dUTP, BODIPY®R-14-dUTP, BODIPY®R-14-dUTP, RhodamineGreen™-5-dUTP, Oregon Green®488-5-dUTP, Texas Red®-12-dUTP, BODIPY®630/650-14-dUTP, BODIPY® 650/665-14-dUTP, Alexa Fluor® 488-5-dUTP, AlexaFluor® 532-5-dUTP, Alexa Fluor® 568-5-dUTP, Alexa Fluor® 594-5-dUTP,Alexa Fluor® 546-14-dUTP, fluorescein-12-UTP,tetramethylrhodamine-6-UTP, Texas Red®-5-UTP, Cascade Blue®-7-UTP,BODIPY® FL-14-UTP, BODIPY® TMR-14-UTP, BODIPY® TR-14-UTP, RhodamineGreen™-5-UTP, Alexa Fluor® 488-5-UTP, Alexa Fluor® 546-14-UTP (MolecularProbes, Inc. Eugene, Oreg., USA). Other fluorophores available forpost-synthetic attachment include, inter alia, Alexa Fluor® 350, AlexaFluor® 532, Alexa Fluor® 546, Alexa Fluor® 568, Alexa Fluor® 594, AlexaFluor® 647, BODIPY® 493/503, BODIPY FL, BODIPY R6G, BODIPY 530/550,BODIPY TMR, BODIPY 558/568, BODIPY 558/568, BODIPY 564/570, BODIPY576/589, BODIPY 581/591, BODIPY 630/650, BODIPY 650/665, Cascade Blue,Cascade Yellow, Dansyl, lissamine rhodamine B, Marina Blue, Oregon Green488, Oregon Green 514, Pacific Blue, rhodamine 6G, rhodamine green,rhodamine red, tetramethylrhodamine, Texas Red (available from MolecularProbes, Inc., Eugene, Oreg., USA), and Cy2, Cy3.5, Cy5.5, and Cy7(Amersham Biosciences, Piscataway, N.J. USA, and others). FRET tandemfluorophores may also be used, such as PerCP-Cy5.5, PE-Cy5, PE-Cy5.5,PE-Cy7, PE-Texas Red, and APC-Cy7; also, PE-Alexa dyes (610, 647, 680)and APC-Alexa dyes. Biotin, or a derivative thereof, may also be used asa label on a detection oligonucleotide, and subsequently bound by adetectably labeled avidin/streptavidin derivative (e.g.phycoerythrin-conjugated streptavidin), or a detectably labeledanti-biotin antibody. Digoxigenin may be incorporated as a label andsubsequently bound by a detectably labeled anti-digoxigenin antibody(e.g. fluoresceinated anti-digoxigenin). An aminoallyl-dUTP residue maybe incorporated into a detection oligonucleotide and subsequentlycoupled to an N-hydroxy succinimide (NHS) derivitized fluorescent dye,such as those listed supra. In general, any member of a conjugate pairmay be incorporated into a detection oligonucleotide provided that adetectably labeled conjugate partner can be bound to permit detection.As used herein, the term antibody refers to an antibody molecule of anyclass, or any subfragment thereof, such as an Fab. Other suitable labelsfor detection oligonucleotides may include fluorescein (FAM),digoxigenin, dinitrophenol (DNP), dansyl, biotin, bromodeoxyuridine(BrdU), hexahistidine (6×His), phosphor-amino acids (e.g. P-tyr, P-ser,P-thr), or any other suitable label. In one embodiment the followinghapten/antibody pairs are used for detection, in which each of theantibodies is derivatized with a detectable label: biotin/α-biotin,digoxigenin/α-digoxigenin, dinitrophenol (DNP)/α-DNP,5-Carboxyfluorescein (FAM)/α-FAM. As described in schemes below, probesmay also be indirectly labeled, especially with a hapten that is thenbound by a capture agent, e.g. as disclosed in Holtke et al, U.S. Pat.Nos. 5,344,757; 5,702,888; and 5,354,657; Huber et al, U.S. Pat. No.5,198,537; Miyoshi, U.S. Pat. No. 4,849,336; Misiura and Gait, PCTpublication WO 91/17160; and the like. Many different hapten-captureagent pairs are available for use with the invention. Exemplary, haptensinclude, biotin, des-biotin and other derivatives, dinitrophenol,dansyl, fluorescein, CY5, and other dyes, digoxigenin, and the like. Forbiotin, a capture agent may be avidin, streptavidin, or antibodies.Antibodies may be used as capture agents for the other haptens (manydye-antibody pairs being commercially available, e.g. Molecular Probes).

In one aspect, pools of probes are provided which preferably have fromabout 1 to about 3 bases, allowing for an even and optimized signal fordifferent sequences at degenerate positions. In one embodiment, aconcentration adjusted mix of 3-mer building blocks is used in the probesynthesis.

Probes may be prepared with nucleic acid tag tails instead of beingdirectly labeled. Tails preferably do not interact with test DNA. Thesetails may be prepared from natural bases or modified bases such as isoCand isoG that pair only between themselves. If isoC and isoG nucleotidesare used, the sequences may be separately synthesized with a 5′amino-linker, which allows conjugation to a 5′ carboxy modified linkerthat is synthesized on to each tagged probe. This allows separatelysynthesized tag sequences to be combined with known probes while theyare still attached to the column. In one embodiment, 21 tagged sequencesare used in combination with 1024 known probes.

The tails may be separated from probes by 1-3 or more degenerated bases,abasic sites or other linkers. One approach to minimize interaction oftails and target DNA is to use sequences that are very infrequent in thetarget DNA. For example, CGCGATATCGCGATAT (SEQ. ID NO. 14) orCGATCGATCGAT (SEQ. ID NO. 15) is expected to be infrequent in mammaliangenomes. One option is to use probe with tails pre-hybridized withunlabeled tags that would be denaturated and may be washed away afterligation and before hybridization with labeled tags. Uracil may be usedto generate degradable tails/tags and to remove them before running anew cycle instead of using temperature removal;

In one aspect high-plex multiplex ligation assays of probes are usedwhich are not labeled with fluorescent dyes, thus reducing backgroundand assay costs. For example for 8 colors 4×8=32 different encodingtails may be prepared and 32 probes as a pool may be used inhybridization/ligation. In the decoding process, four cycles each with 8tags are used. Thus, each color is used for 4 tags used in 4 decodingcycles. After each cycle, tags may be removed or dyes photo bleached.The process requires that the last set of probes to be decoded has tostay hybridized through 4 decoding cycles.

In one embodiment, additional properties are included to provide theability to distinguish different probes using the same color, forexample Tm/stability, degradability by incorporated uracil bases and UDGenzyme, and chemically or photochemically cleavable bonds. A combinationof two properties, such as temperature stability directly or aftercutting or removing a stabilizer to provide 8 distinct tags for the samecolor; more than one cut type may be used to create 3 or more groups; toexecute this 4-8 or 6-12 exposures of the same color may be required,demanding low photo-bleaching conditions such as low intensity lightillumination that may be detected by intensified CCDs (ICCDs). Forexample if one property is melting temperature (Tm) and there are 4tag-oligos or anchors or primers with distinct Tm, another set of 4oligos can be prepared that has the first 4 probes connected to orintractable with a stabilizer that shifts the Tm of these 4 oligos abovethe most stable oligo in the first group without stabilizer. Afterresolving 4 oligos from the first group by consecutive melting off, thetemperature may be reduced to the initial low level, the stabilizer maybe cut or removed, and 4 tagged-oligos or anchors or primers can then bedifferentially melted using the same temperature points as for the firstgroup.

In one aspect, probe-probe hybrids are stabilized through ligation toanother unlabeled oligonucleotide.

Methods of Sequencing Using Interspersed Adaptors

In one aspect, the invention includes a method of determining anucleotide sequence of a target polynucleotide, the method comprisingthe steps of: (a) generating a plurality of interspersed adaptors withina target polynucleotide, each interspersed adaptor having at least oneboundary with the target polynucleotide; and (b) determining theidentity of at least one nucleotide adjacent to at least one boundary ofat least two interspersed adaptors, thereby determining a nucleotidesequence of the target polynucleotide. As is more fully outlined below,the target sequence comprises a position for which sequence informationis desired, generally referred to herein as the “detection position”. Ingeneral, sequence information (e.g. the identification of the nucleotideat a particular detection position) is desired for a plurality ofdetection positions. By “plurality” as used herein is meant at leasttwo. In some cases, however, for example in single nucleotidepolymorphism (SNP) detection, information may only be desired for asingle detection position within any particular target sequence. As usedherein, the base which basepairs with the detection position base in ahybrid is termed the “interrogation position”.

An important feature of the invention is the use of interspersedadaptors in target polynucleotide amplicons to acquire sequenceinformation related to the target polynucleotides. A variety ofsequencing methodologies may be used with interspersed adaptors,including, but not limited to, hybridization-based methods, such asdisclosed in Drmanac, U.S. Pat. Nos. 6,864,052; 6,309,824; and6,401,267; and Drmanac et al, U.S. patent publication 2005/0191656, andsequencing by synthesis methods, e.g. Nyren et al, U.S. Pat. No.6,210,891; Ronaghi, U.S. Pat. No. 6,828,100; Ronaghi et al (1998),Science, 281: 363-365; Balasubramanian, U.S. Pat. No. 6,833,246; Quake,U.S. Pat. No. 6,911,345; Li et al, Proc. Natl. Acad. Sci., 100: 414-419(2003); Smith et al, PCT publication WO 2006/074351; and ligation-basedmethods, e.g. Shendure et al (2005), Science, 309: 1728-1739, Macevicz,U.S. Pat. No. 6,306,597; which references are incorporated by reference.

In one aspect, a method of determining a nucleotide sequence of a targetpolynucleotide in accordance with the invention comprises the followingsteps: (a) generating a plurality of target concatemers from the targetpolynucleotide, each target concatemer comprising multiple copies of afragment of the target polynucleotide and the plurality of targetconcatemers including a number of fragments that substantially coversthe target polynucleotide; (b) forming a random array of targetconcatemers fixed to a surface at a density such that at least amajority of the target concatemers are optically resolvable; (c)identifying a sequence of at least a portion of each fragment in eachtarget concatemer; and (d) reconstructing the nucleotide sequence of thetarget polynucleotide from the identities of the sequences of theportions of fragments of the concatemers. Usually, “substantiallycovers” means that the amount of DNA analyzed contains an equivalent ofat least two copies of the target polynucleotide, or in another aspect,at least ten copies, or in another aspect, at least twenty copies, or inanother aspect, at least 100 copies. Target polynucleotides may includeDNA fragments, including genomic DNA fragments and cDNA fragments, andRNA fragments. Guidance for the step of reconstructing targetpolynucleotide sequences can be found in the following references, whichare incorporated by reference: Lander et al, Genomics, 2: 231-239(1988); Vingron et al, J. Mol. Biol., 235: 1-12 (1994); and likereferences.

In one aspect of the invention, a ligation-based sequencing method maybe used as illustrated in FIGS. 3A-3E. Many different variations of thissequencing approach may be selected by one of ordinary skill in the artdepending on factors, such as, the volume of sequencing desired, thetype of labels employed, the type of target polynucleotide ampliconsemployed and how they are attached to a surface, the desired speed ofsequencing operations, signal detection approaches, and the like. Thevariations shown in FIGS. 3A-3E are only exemplary.

In one aspect of the invention, a labeled probe is able to form a stablehybrid only after ligation to a pairing probe. The use of probe ligationimproves data specificity over standard sequencing by hybridizationmethods. Probe ligation also has application in position specific baseidentification (e.g. DNA ends) or in a whole sequence scanningmethodology (e.g. all internal overlapping sequences).

To identify sequences at a specific site in the unknown sequence, suchas at the ends of the sequence, the labeled probes can be designed toallow ligation to an anchor probe. The longer anchor probe is hybridizedto a known adaptor sequence that is adjacent to the end of the unknownsequence to be determined, e.g. the detection positions. Labeled probescan have various numbers of specific and degenerated bases. For example,2 end bases can be determined with the probe BBNNNNNN (A=anchor,D=adaptor, G=genomic, B=probe defining bases, N=degenerate bases.*=label):

      AAAAAAAAA.BBNNNNNN* DDDDDDDDDDDDDDGGGGGGGGGGGGGGGG

For such a probe structure there are 16 sequence-reading probes, eachconsisting of 2 specific bases at the 5-prime end. If all 16 probes aretested, only one would efficiently ligate to the anchor probe and give astrong signal, after removing probes that are not ligated the to anchorprobe. Such a positive probe detects two bases at the end of genomic DNAfragment, with a high specificity provided by the strong preference ofT4 DNA ligase for complementary bases close to the ligation site.

In one aspect of the invention, a single stranded target polynucleotideis provided that contains a plurality of interspersed adaptors. In FIG.3A, three interspersed adaptors (3002, 3004, and 3006) are shown, whichmay be part of an amplicon, such as a concatemer, comprising multiplecopies of target polynucleotide (3000). Each interspersed adaptor has aregion (e.g. 3008 and 3012) at each end that has a unique sequence (inthis example six such unique sequences among three interspersed adaptorsin all) designed as a binding site for a corresponding anchor probe,which is an oligonucleotide (which may or may not carry a label) towhich a sequencing probe is ligated. Such end regions may have lengthsin the range of from 6 to 14 nucleotides, and more usually, from 8 to 12nucleotides. Interspersed adaptors optionally have central region(3010), which may contain additional elements such as recognition sitesfor various enzymes (when in double stranded form) or binding sites forcapture oligonucleotides for immobilizing the target polynucleotideamplicons on a surface, and so on. In one aspect, a sequencing operationwith interspersed adaptors (3002-3006) comprises six successive routinesof hybridizing anchor probes to each of the different unique anchorprobe binding sites. Each such routine comprises a cycle of hybridizingthe anchor probe to its end site of its interspersed adaptor, combiningwith sequencing probes under conditions that permit hybridization ofonly perfectly matched probes, ligating perfectly matched sequencingprobes to juxtaposed anchor probes, detecting ligated sequencing probes,identifying one or more bases adjacent to the anchor probe by the signalgenerated by the sequencing probe, and removing the sequencing probe andthe anchor probe from the target polynucleotide amplicon.

A further embodiment includes creating a DNA circle of 300-3000 bases inlength and inserting 2-3 adaptors on each side of the initial adaptor.In this way a mating pair of two, 20-60 base long sequences, separatedby 300-3000 bases is generated. In addition to providing twice the levelof sequence data, this method provides valuable mapping information.Mate pairs can bridge over repeats in de novo sequence assembly, and canalso be used to accurately position mutations in repeats longer than20-50 bases in genome re-sequencing. One, or a mating pair of two,˜20-50 base sequences can be complemented with probe hybridization orprobe-probe ligation data. A partial set of ⅛ to 1/16 of all 5-mers,6-mers, 7-mers or 8-mers may be scored to provide mapping informationfor 200-4000 base length fragments. In addition, all probes of a givenlength (such as all 6-mers) may be scored in 4-16 reaction chamberscontaining 4-16 sections of the total DNA array for a given genome. Ineach chamber ¼ to 1/16 of all probes may be scored. After mappingindividual DNA fragments all probes can be compiled to provide 100 to1000 reads per base in overlapped probes in overlapped fragments.

In one embodiment, the six successive routines are repeated from 1 to 4times, preferably from 2 to 3 times, so that nucleotides at differentdistances from the interspersed adaptor may be identified. In anotherembodiment, the six successive routines are carried out once, but eachcycle of anchor probe hybridization, sequencing probe hybridization,ligating, etc., is repeated from 1 to 4, or from 2 to 3 times. Theformer is illustrated in FIG. 3A, so that after anchor probe (3015)hybridizes to its binding site in interspersed adaptor (3002), labeledsequencing probes (3016) are added to the reaction mixture underconditions that permit ligation to anchor probe (3015) if a perfectlymatched duplex is formed.

Sequencing probes may have a variety of different structures. Typically,they contain degenerate sequences and are either directly or indirectlylabeled. In the example of FIG. 3A, sequencing probes are directlylabeled with, e.g. fluorescent dyes F1, F2, F3, and F4, which generatesignals that are mutually distinguishable, and fluorescent dyes G1, G2,G3, and G4, which also generate signals that are mutuallydistinguishable. In this example, since dyes of each set, i.e. F and G,are detected in different cycles, they may be the same dyes. When 8-mersequencing probes are employed, a set of F-labeled probes foridentifying a base immediately adjacent to an interspersed adaptor mayhave the following structure: 3′-F1-NNNNNNNAp, 3′-F2-NNNNNNNCp,3′-F3-NNNNNNNGp, 3′-F4-NNNNNNNT. Here it is assumed that sequence (3000)is in a 5′-3′ orientation from left to right; thus, the F-labeled probesmust carry a phosphate group on their 5′ ends, as long as conventionalligase-mediated ligation reactions are used. Likewise, a correspondingset of G-labeled probes may have the following structure:3′-ANNNNNNN-G1,3′-CNNNNNNN-G2,3′-GNNNNNNN-G3,3′-TNNNNNNN-G4, and forligation of these probes, their associated anchor probe must have a5′-phosphate group. F-labeled probes in successive cycles may have thefollowing structures: 3′-F1-NNNNNNANp, 3′-F2-NNNNNNCNp, 3′-F3-NNNNNNGNp,3′-F4-NNNNNNTN, and 3′-F1-NNNNNANNp, 3′-F2-NNNNNCNNp, 3′-F3-NNNNNGNNp,3′-F4-NNNNNTNN, and so on.

Returning to FIG. 3A, after ligated probe (3018) is identified, it isremoved from the target polynucleotide amplicon (3020), and the nextanchor probe (3022) is hybridized to its respective binding site.G-labeled sequencing probes are hybridized to the target polynucleotideso that those forming perfectly match duplexes juxtaposed to the anchorprobe are ligated and identified. This process continues for each anchorprobe binding site until the last ligated probe (3028) is identified.The whole sequence of cycles is then repeated (3030) using F-labeledsequencing probes and G-labeled sequencing probes that are design toidentify a different base adjacent to its respective anchor probe.

FIG. 3B illustrates a variant of the method of FIG. 3A in which anchorprobes are hybridized to their respective binding sites two-at-a-time.Any pair of anchor probes may be employed as long as one member of thepair binds to a 3′ binding site of an interspersed adaptor and the othermember of the pair binds to a 5′ binding site of an interspersedadaptor. For directly labeled sequencing probes, as shown, thisembodiment requires the use of eight distinguishable labels; that is,each of the labels F1-F4 and G1-G4 must be distinguishable from oneanother. In FIG. 3B, anchor probes (3100 and 3102) are hybridized totheir respective binding sites in interspersed adaptor (3002), afterwhich a set of sequencing probes (3104) is added under stringenthybridization conditions. Probes that form perfectly matched duplexesare ligated, unligated probes are washed away, after which the ligatedprobes are identified. Cycles of such hybridization, ligation andwashing are repeated (3110) with sets of sequencing probes designed toidentify bases at different sites adjacent to interspersed adaptor(3002). The process is then repeated for each interspersed adaptor.

FIG. 3C illustrates another variant of the embodiment of FIG. 3A, inwhich sequencing probes for identify bases at every site adjacent to ananchor probe are carried out to completion before an anchor probe forany other interspersed adaptor is used. Briefly, the steps within eachdashed box (3200) are carried out for each anchor probe binding site,one at a time; thus, each dashed box corresponds to a different anchorprobe binding site. Within each box, successive cycles are carried outcomprising the steps of hybridizing an anchor probe, ligating sequencingprobes, identifying ligated sequencing probes.

FIG. 3D illustrates an embodiment that employs encoded label, similar tothose used with the encoded adaptors disclosed by Albrecht et al, U.S.Pat. No. 6,013,445, which is incorporated herein by reference. Theprocess is similar to that described in FIG. 3C, except that instead ofdirectly labeled sequencing probes, such probes are indirectly labeledwith oligonucleotide tags. By using such tags, the number of ligationsteps can be reduced, since each sequencing probe mixture may containsequences to identify many more than four bases. For example,non-cross-hybridizing oligonucleotide tags may be selected thatcorrespond to each of sixteen pairs of bases, so that after ligation,ligated sequencing probes may be interrogated with sets of labeledanti-tags until each two-base sequence is identified. Thus, the sequenceof a target polynucleotide adjacent to an anchor probe may be identifiedtwo-at-a-time, or three-at-a-time, or more, using encoded sequencingprobes. Going to FIG. 3D, anchor probe (352) is hybridized to anchorbinding site (381), after which encoded sequencing probes are addedunder conditions that permit only perfectly complementary sequencingprobes (354) to be ligated to anchor probes (352). After such ligationand washing away of un-ligated sequencing probes, labeled anti-tags(358) are successively hybridized to the oligonucleotide tags of thesequencing probes under stringent conditions so that only labeledanti-tags forming perfectly matched duplexes are detected. A variety ofdifferent labeling schemes may be used with the anti-tags. A singlelabel may be used for all anti-tags and each anti-tag may be separatelyhybridized to the encoded sequencing tags. Alternatively, sets ofanti-tags may be employed to reduce the number of hybridizations andwashings that must be carried out. For example, where each sequencingprobe identifies two bases, two sets of four anti-tags each may beapplied, wherein each tag in a given set carries a distinct labelaccording to the identity of one of the two bases identified by thesequencing probe. Likewise, if a sequencing probe identifies threebases, then three sets of four anti-tags each may be used for decoding.Such cycles of decoding may be carried out for each interspersedadaptor, after which additional cycles may be carried out usingsequencing probes that identify bases at different sites.

FIG. 3E illustrates an embodiment similar to that described in FIG. 3B,except that here encoded sequencing probes are employed. Thus, twoanchor probes are hybridized to a target polynucleotide at a time andthe corresponding sequencing probes are identified by decoding withlabeled anti-tags. As shown, anchor probes (316 and 318) are hybridizedto their respective binding sites on interspersed adaptor (3002), afterwhich two sets of encoded sequencing probes (327) are added underconditions that permit only such probes forming perfectly matchedduplexes to be ligated. After removal of unligated probes, theoligonucleotide tags of the ligated probes are decoded with labeledanti-tags. As above, a variety of schemes are available for decoding theligated sequencing probes.

In another aspect, a sequencing method for use with the invention fordetermining sequences in a plurality of DNA or RNA fragments comprisesthe following steps: (a) generating a plurality of polynucleotidemolecules each comprising a concatemer of a DNA or RNA fragment; (b)forming a random array of polynucleotide molecules fixed to a surface ata density such that at least a majority of the target concatemers areoptically resolvable; and (c) identifying a sequence of at least aportion of each DNA or RNA fragment in resolvable polynucleotides usingat least one chemical reaction of an optically detectable reactant. Inone embodiment, such optically detectable reactant is anoligonucleotide. In another embodiment, such optically detectablereactant is a nucleoside triphosphate, e.g. a fluorescently labelednucleoside triphosphate that may be used to extend an oligonucleotidehybridized to a concatemer. In another embodiment, such opticallydetectable reagent is an oligonucleotide formed by ligating a first andsecond oligonucleotide to form adjacent duplexes on a concatemer. Inanother embodiment, such chemical reaction is synthesis of DNA or RNA,e.g. by extending a primer hybridized to a concatemer.

In one aspect, parallel sequencing of concatemers of targetpolynucleotides on a random array is accomplished by combinatorial SBH(cSBH), as disclosed by Drmanac in the above-cited patents. In oneaspect, a first and second sets of oligonucleotide probes are provide,wherein each sets has member probes that comprise oligonucleotideshaving every possible sequence for the defined length of probes in theset. For example, if a set contains probes of length six, then itcontains 4096 (=4⁶) probes. In another aspect, first and second sets ofoligonucleotide probes comprise probes having selected nucleotidesequences designed to detect selected sets of target polynucleotides.Sequences are determined by hybridizing one probe or pool of probe,hybridizing a second probe or a second pool of probes, ligating probesthat form perfectly matched duplexes on their target sequences,identifying those probes that are ligated to obtain sequence informationabout the target sequence, repeating the steps until all the probes orpools of probes have been hybridized, and determining the nucleotidesequence of the target from the sequence information accumulated duringthe hybridization and identification steps.

For sequencing operations, in some embodiments, the sets may be dividedinto subsets that are used together in pools, as disclosed in U.S. Pat.No. 6,864,052. Probes from the first and second sets may be hybridizedto target sequences either together or in sequence, either as entiresets or as subsets, or pools. In one aspect, lengths of the probes inthe first or second sets are in the range of from 5 to 10 nucleotides,and in another aspect, in the range of from 5 to 7 nucleotides, so thatwhen ligated they form ligation products with a length in the range offrom 10 to 20, and from 10 to 14, respectively.

In another aspect, using such techniques, the sequence identity of eachattached DNA concatemer may be determined by a “signature” approach.About 50 to 100 or possibly 200 probes are used such that about 25-50%or in some applications 10-30% of attached concatemers will have a fullmatch sequence for each probe. This type of data allows each amplifiedDNA fragment within a concatemer to be mapped to the reference sequence.For example, by such a process one can score 64 4-mers (i.e. 25% of allpossible 256 4-mers) using 16 hybridization/stripoff cycles in a 4colors labeling schema. On a 60-70 base fragment amplified in aconcatemer about 16 of 64 probes will be positive since there are 64possible 4-mers present in a 64 base long sequence (i.e. one quarter ofall possible 4-mers). Unrelated 60-70 base fragments will have a verydifferent set of about 16 positive decoding probes. A combination of 16probes out of 64 probes has a random chance of occurrence in 1 of everyone billion fragments which practically provides a unique signature forthat concatemer. Scoring 80 probes in 20 cycles and generating 20positive probes create a signature even more likely to be unique:occurrence by chance is 1 in billion billions. Previously, a “signature”approach was used to select novel genes from cDNA libraries. Animplementation of a signature approach is to sort obtained intensitiesof all tested probes and select up to a predefined (expected) number ofprobes that satisfy the positive probe threshold. These probes will bemapped to sequences of all DNA fragments (sliding window of a longerreference sequence may be used) expected to be present in the array. Thesequence that has all or a statistically sufficient number of theselected positive probes is assigned as the sequence of the DNA fragmentin the given concatemer. In another approach an expected signal can bedefined for all used probes using their pre measured full match andmismatch hybridization/ligation efficiency. In this case a measuresimilar to the correlation factor can be calculated.

A preferred way to score 4-mers is to ligate pairs of probes, forexample: N₍₅₋₇₎BBB with BN₍₇₋₉₎, where B is the defined base and N is adegenerate base. For generating signatures on longer DNA concatemerprobes, more unique bases will be used. For example, a 25% positive ratein a fragment 1000 bases in length would be achieved by N₍₄₋₆₎BBBB andBBN₍₆₋₈₎. Note that longer fragments need the same number of about 60-80probes (15-20 ligation cycles using 4 colors).

In one embodiment all probes of a given length (e.g. 4096N₂₋₄BBBBBBN₂₋₄) or all ligation pairs may be used to determine completesequence of the DNA in a concatemer. For example, 1024 combinations ofN₍₅₋₇₎B₃ and BBN₍₆₋₈₎ may be scored (256 cycles if 4 colors are used) todetermine sequence of DNA fragments of up to about 250 bases, preferablyup to about 100 bases.

The decoding of sequencing probes with large numbers of Ns may beprepared from multiple syntheses of subsets of sequences at degeneratedbases to minimize difference in the efficiency. Each subset is added tothe mix at a proper concentration. Also, some subsets may have moredegenerated positions than others. For example, each of 64 probes fromthe set N₍₅₋₇₎BBB may be prepared in 4 different synthesis. One isregular all 5-7 bases to be fully degenerated; second is N0-3(A,T)5BBB;third is N0-2(A,T)(G,C)(A,T)(G,C)(A,T)BBB, and the fourth isN0-2(G,C)(A,T)(G,C)(A,T)(G,C)BBB.

Oligonucleotide preparation from the three specific syntheses is addedin to regular synthesis in experimentally determined amounts to increasehybrid generation with target sequences that have in front of the BBBsequence an AT rich (e.g. AATAT) or (A or T) and (G or C) alternatingsequence (e.g. ACAGT or GAGAC). These sequences are expected to be lessefficient in forming a hybrid. All 1024 target sequences can be testedfor the efficiency to form hybrid with N₀₋₃NNNNNBBB probes and thosetypes that give the weakest binding may be prepared in about 1-10additional synthesis and added to the basic probe preparation.

In another embodiment, a smaller number of probes is used for a smallnumber of distinct samples; for example, 5-7 positive out of 20 probes(5 cycles using 4 colors) has the capacity to distinguish about 10-100thousand distinct fragments

In one aspect, 8-20-mer RCR products are decoded by providing arraysformed as random distributions of unique 8 to 20 base recognitionsequences in the form of DNA concatemers. The probes are decoded todetermine the sequence of the 8-20 base probe region using a number ofpossible methods. In an exemplary method, one half of the sequence isdetermined by utilizing the hybridization specificity of short probesand the ligation specificity of fully matched hybrids. Six to ten basesadjacent to the 12 mer are predefined and act as a support for a 6mer to10-mer oligonucleotide. This short 6mer will ligate at its 3-prime endto one of 4 labeled 6-mers to 10-mers. These decoding probes consist ofa pool of 4 oligonucleotides in which each oligonucleotide consists of4-9 degenerate bases and 1 defined base. This oligonucleotide will alsobe labeled with one of four fluorescent labels. Each of the 4 possiblebases A, C, G, or T will therefore be represented by a fluorescent dye.For example these 5 groups of 4 oligonucleotides and one universaloligonucleotide (Us) can be used in the ligation assays to sequencefirst 5 bases of 12-mers: δ=each of 4 bases associated with a specificdye or tag at the end:

UUUUUUUU.BNNNNNNN* UUUUUUUU.NBNNNNNN UUUUUUUU.NNBNNNNN UUUUUUUU.NNNBNNNNUUUUUUUU.NNNNBNNN

Six or more bases can be sequenced with additional probe pools. Toimprove discrimination at positions near the center of the 12-mer the6-mer oligonucleotide may be positioned further into the 12-mersequence. This will necessitate the incorporation of degenerate basesinto the 3′ end of the non-labeled oligonucleotide to accommodate theshift. This is an example of decoding probes for position 6 and 7 in the12-mer:

UUUUUUNN.NNNBNNNN UUUUUUNN.NNNNBNNN

In a similar way the 6 bases from the right side of the 12-mer can bedecoded by using a fixed oligonucleotide and 5-prime labeled probes. Inthe above described system 6 cycles are required to define 6 bases ofone side of the 12-mer. With redundant cycle analysis of bases distantto the ligation site this may increase to 7 or 8 cycles. Completesequencing of the 12-mer can thus be accomplished with 12-16 cycles ofligation.

In one embodiment, the invention provides a method for partial orcomplete sequencing of arrayed DNA by combining two distinct types oflibraries of detector probes. In this approach one set has probes of thegeneral type N₃₋₈B₄₋₆ (anchors) that are ligated with the first 2 or 3or 4 probes/probe pools from the set BN₆₋₈, NBN₅₋₇, N₂BN₄₋₆, andN₃BN₃₋₅. In an exemplary method, 1-4 4-mers or more are hybridized to5-mer anchors to obtain 1 or 2 anchors per DNA for about 70%-80% of themolecules. In one embodiment, the positive anchor is determined bymixing specific probes with distinct hybrid stability (maybe differentnumber of Ns in addition). Anchors may be also tagged to determine whichanchor from the pool is hybridized to a spot. Tags, as additional DNAsegments, may be used for adjustable displacement as a detection method.For example, EEEEEEEENNNAAAAA and FFFFFFFFNNNCCCCC probes can be afterhybridization or hybridization and ligation differentially removed withtwo corresponding displacers: EEEEEEEENNNNN and FFFFFFFFNNNNNNNN wherethe second is more efficient. In another embodiment, separate cycles maybe used to determine which anchor is positive. For this purpose anchorslabeled or tagged with multiple colors may be ligated to unlabeledN7-N10 supporter oligonucleotides.

The BNNNNNNNN probe is then hybridized with 4 colors corresponding to 4bases. A discriminative wash or displacement by complement to the tag isused to read which of two scored bases is associated to an anchor if twoanchors are positive in one DNA. Thus, two 7-10 base sequences can bescored at the same time. 2-4 cycles can be used to extend to a 4-6 baseanchor for an additional 2-4 base run of 16 different anchors per eacharray (32-64 physical cycles if 4 colors are used) to determine about 16possible 8-mers (˜100 bases total) per each fragment. This is sufficientto map it to the reference probability that a 100-mer will have a set of10 8-mers is less than 1 in trillion trillions; (10e⁻²⁸). By combiningdata from different anchors scored in parallel on the same fragment inanother array complete sequence of that fragment and by extension toentire genomes may be generated from overlapping 7-10-mers.

In one aspect, the invention provides methods for tagging probes withDNA tags for larger multiplex of decoding or sequence determinationprobes. Instead of a direct label, the probes can be tagged withdifferent oligonucleotide sequences made of natural bases or newsynthetic bases (such as isoG and isoC). Tags can be designed to havevery precise binding efficiency with their anti-tags using differentoligonucleotide lengths (about 6-24 bases) and/or sequence including GCcontent. For example 4 different tags may be designed that can berecognized with specific anti-tags in 4 consecutive cycles or in onehybridization cycle followed by a discriminative wash. In thediscriminative wash, the initial signal is reduced to 95-99%, 30-40%,10-20% and 0-5% for each tag, respectively. In this case by obtainingtwo images 4 measurements are obtained assuming that probes withdifferent tags will rarely hybridize to the same dot. Another benefit ofhaving many different tags even if they are consecutively decoded (or2-16 at a time labeled with 2-16 distinct colors) is the ability to usea large number of individually recognizable probes in one assayreaction. This way a 4-64 times longer assay time (that may provide morespecific or stronger signal) may be affordable if the probes are decodedin short incubation and removal reactions.

The decoding process requires the use of 48-96 or more decoding probes.These pools will be further combined into 12-24 or more pools byencoding them with four fluorophores, each having different emissionspectra. Using a 20× objective, each 6 mm×6 mm array may require roughly30 images for full coverage by using a 10 mega pixel camera. Each 1micrometer array area is read by about 8 pixels. Each image can beacquired in 250 milliseconds: 150 ms for exposure and 100 ms to move thestage. Using this fast acquisition it will take ˜7.5 seconds to imageeach array, or 12 minutes to image the complete set of 96 arrays on eachsubstrate.

In one embodiment of an imaging system, a high image acquisition rate isachieved by using four ten-megapixel cameras, each imaging the emissionspectra of a different fluorophore. The cameras are coupled to themicroscope through a series of dichroic beam splitters. The autofocusroutine, which takes extra time, runs only if an acquired image is outof focus. It will then store the Z axis position information to be usedupon return to that section of that array during the next imaging cycle.By mapping the autofocus position for each location on the substrate wewill drastically reduce the time required for image acquisition.

Typically, each array requires about 12-24 cycles to decode. Each cycleconsists of a hybridization, wash, array imaging, and strip-off step.These steps, in their respective orders, may take for the above example5, 2, 12, and 5 minutes each, for a total of 24 minutes each cycle, orroughly 5-10 hours for each array, if the operations are performedlinearly. The time to decode each array can be reduced by a factor oftwo by allowing the system to image constantly. To accomplish this, theimaging of two separate substrates on each microscope is staggered,i.e., while one substrate is being reacted, the other substrate isimaged.

An exemplary decoding cycle using cSBH includes the following steps: (i)set temperature of array to hybridization temperature (usually in therange 5-25° C.); (ii) use robot pipetter to pre mix a small amount ofdecoding probe with the appropriate amount of hybridization buffer;(iii) pipette mixed reagents into hybridization chamber; (iv) hybridizefor predetermined time; (v) drain reagents from chamber using pump(syringe or other); (vi) add a buffer to wash mismatches of non-hybrids;(vii) adjust chamber temperature to appropriate wash temp (about 10-40°C.); (viii) drain chamber; (ix) add more wash buffer if needed toimprove imaging; (x) image each array, preferably with a mid power (20×)microscope objective optically coupled to a high pixel count highsensitivity CCD camera, or cameras; plate stage moves chambers (orperhaps flow-cells with input funnels) over object, or objective-opticsassembly moves under chamber; certain optical arrangements, usingdichroic mirrors/beam-splitters can be employed to collectmulti-spectral images simultaneously, thus decreasing image acquisitiontime; arrays can be imaged in sections or whole, depending onarray/image size/pixel density; sections can be assembled by aligningimages using statistically significant empty regions pre-coded ontosubstrate (during active site creation) or can be made using a multistep nano-printing technique, for example sites (grid of activatedsites) can be printed using specific capture probe, leaving emptyregions in the grid; then print a different pattern or capture probe inthat region using separate print head; (xi) drain chamber and replacewith probe strip buffer (or use the buffer already loaded) then heatchamber to probe strip off temperature (60-90° C.); high pH buffer maybe used in the strip-off step to reduce stripoff temperature; wait forthe specified time; (xii) remove buffer; (xiii) start next cycle withnext decoding probe pool in set.

Combinatorial Probe Ligation for Sequencing by Hybridization

In a preferred aspect of the invention, information on the sequence of atarget polynucleotide is obtained through a sequencing by hybridizationmethod which utilizes combinatorial probe ligation. In this aspect ofthe invention, two complete, universal sets of short probes are exposedto target DNA in the presence of DNA ligase (R. Drmanac, U.S. Pat. No.6,401,267, 2002). Typically one probe set is attached to a solid supportsuch as a glass slide, while the other set, labeled with fluorophores,is mobile in solution. When attached and labeled probes hybridize to thetarget at precisely adjacent positions, they are ligated, generating along, labeled probe that is covalently linked to the slide surface. Apositive signal at a given position indicates the presence of a sequencewithin the target that complements the two probes that were combined togenerate the signal.

In a preferred embodiment a universal sequencing chip, such as theHyChip™ slide developed by Complete Genomics, is used in thecombinatorial sequencing by hybridization methods of the presentinvention. In one embodiment, each HyChip™ comprises a regularmicroscope glass slide containing eight replica arrays of attached6-mers, allowing analysis using a complete set of over four million11-mer probes per sample using 4096 arrayed 6-mers and 1024 labeled5-mer probes. In a preferred embodiment, the sequencing method utilizingthe HyChip™ system is used to sequence mixtures of separate, unrelatedDNA fragments.

DNA samples for use with the sequencing methods of the present inventioncan be prepared by PCR.

In a preferred aspect, the invention provides an array of millions ofindividual polynucleotide molecules, randomly disposed on an opticallyclear surface at density of about one spot per square micron. Thesepolynucleotide molecules serve as templates for hybridization andligation of fluorescent-tagged probe pools. In one embodiment, probepools are mixed with DNA ligase and presented to the random array. Whenprobes hybridize to adjacent sites on a target fragment, they areligated together, forming a stable hybrid. A sensitive mega pixel CCDcamera with advanced optics can be used to simultaneously detectmillions of these individual hybridization/ligation events on the entirearray. Once signals from the first pool pair are detected, the probesare removed and successive ligation cycles are used to test differentprobe combinations. In preferred aspects of the invention, a 3.2×3.2 mmarray will have the capacity to hold 10 million fragments, orapproximately 1-10 billion DNA bases.

Combinatorial Labeling Using Labeled Tags

In one aspect, a single hybridization/ligation cycle can be used to testall 16 possible probes by using 16 fluorescent colors. Such a test mayalso be accomplished using methodologies to create fluorescentsignatures from fewer fluorescent colors. In fluorescent in-situhybridization (FISH) chromosomal “painting”, combinations of fluorescentprobes can be utilized to create new fluorescent signatures for thatcombination of probes. For example, combinations of two probes from aset of 4 can create 10 possible signature fluorescent signals, 5 cancreate 15, 6 can create 21 and so on. Therefore, in a singlehybridization cycle it would be possible to distinguish which one of 16probes was hybridized to the anchor probe.

Alternatively, if one of the BBNNNNNN probes was left unlabeled (andinferred by lack of signals for all other probes), 5 colors would besufficient to label all of the remaining 15 dinucleotides. Four colorsmay be used to label 4 probes that read a single base, or 8 probes (outof all 16 needed probes) to read two bases. In this latter case all 16probes could be scored in two cycles (see below). Thus, a 5 or 6 colorsystem may be much easier to implement than 16 colors required bynon-combinatorial labeling.

For efficient combinatorial labeling, 2-mer probes may be prepared witha tail sequence containing tag binding sites. Tail sequences can becombinatorially designed for binding 2 out of 5 (or 6) labeledoligonucleotide tags or 16 tags with one or two fluorescent dyes can besynthesized for each of the 16 tails. Use of labeled tags instead ofdirectly labeled probes has additional advantages. Testing all 16BBNNNNNN probes would require about 1024-fold more probe (assuming lowdiscrimination at positions further from the ligation site) than for asingle probe. For example, to have the probe AGCTANNN at 1 μMconcentration within a probe mix of BBNNNNNN, the mix should need to beat 1024 μM. Since labeled probes are much costlier to synthesize thanunlabeled probes, the unlabeled probes could be detected with a tailsequence, with the labeled tag probe used at a low concentration sinceit may be perfectly complementary to the tail sequence. Additionally,using unlabeled tailed probes would be advantageous in maintaining alower background because the fluorophore would be at low concentration.An overall 100-fold cost reduction is expected by using 6 labeled tags(without degenerate bases) instead of the equivalent 1024 labeledprobes.

Tags also provide an efficient option to use only 4 colors to read all16 dinucleotides in a single ligation reaction. In such an embodiment,two sets of 4 distinct tags may be designed for decoding 8 2-mers each.All 16 2-mers can be decoded in two decoding cycles. This strategy canbe expanded to use the same 4 colors for reading 2 bases on each end ofan adaptor. In this case, 4 groups of 4 tags may be used in 4 decodingsteps for each ligation cycle that reads 4 bases. Performing multipledecoding cycles instead of multiple ligation cycles is less expensive(less enzyme is used), and ligation cycles may be extended for longertime, with lower probe concentration, to reduce mismatch ligation.

Tags may also be designed to minimize interference with the analyzedDNA, for example by using isoC and isoG base pairs that do not pair withnatural bases. Another option is to use standard DNA chemistry butdesign sequences that are very infrequent in the human genome. Yetanother option is to use a probe with tails pre-hybridized withunlabeled tags that would be removed after ligation and beforehybridization with labeled tags.

Expanding the Number of Bases that can be Decoded

To read further than 2 nucleotides from the anchor probe can in someaspects of the invention utilize additional rounds of probe-anchorligation, with removal of the anchor/label probe from the target priorto the initiation of the next cycle. The ligated probe-anchor can beremoved using a number of methods known in the art, including byheating, or by temperature or light cleavable bonds in the anchor probe,such that the anchor is fragmented and destabilized in the heating step.Since the bases to be sequenced are now 3 and 4 bases from the adaptor,modifications need to be made to the anchor probe or labeled probe. Inthe case of the anchor probe, it can in one embodiment of the inventionbe prepared with 2 additional degenerate bases at the ligation end. Toensure that the efficiency of the subsequent ligation is maintained, inone embodiment the anchor is constructed through ligation of two shorteroligonucleotides on the template DNA. Alternatively, the sequencingprobe can be prepared with two degenerate bases at the ligating end inthe manner of: NNBBNNNN-tag. In another aspect of the invention, theassay may be designed to read an additional 2 bases using 16 anchorprobes.

The specificity of probe-anchor ligation is very high because only 2-4bases around the ligation site are tested. The average discriminationfor these bases is 50-100 fold. Some mismatches such as GT areconsiderably stronger, having discriminations of only 5-20 fold. In anembodiment of the invention, software is provided that can take thedifferences in discrimination of certain mismatches into account.

In an aspect of the invention, each probe, anchor and tag is optimized(for example, by concentration, number of degenerated bases, sequenceand length of tags) to maximally equalize full match signals. Overlappedand shifted pairs of probes and anchors may be designed in oneembodiment of the invention to read each base 2-3 times to increase basecalling accuracy.

The insertion of additional internal adaptors with anchor regions atprecise short distances expands the sequencing capability of bases atdefined positions in the genomic fragment. For example, having theoriginal plus 2 additional adaptors spaced 8 bases apart allows thedetermination of 20 continuous bases in 10 cycles, by reading 4 basesfrom 5 consecutive adaptor ends.

Initial   First   Adaptor 2 2nd     Adaptor 3 Additional adaptor   8bases           8 bases           ˜200 bases DDDDDDDDDD GGGGGGGGDDDDDDDDDD GGGGGGGG DDDDDDDDDD GGGG GGGGGGG   AAAAAAA. BB NNNNNN-AAAAAAA. BB NNNNNN- AAAAAAA. BB NNNNNN-  tail              tail              tail   AAAAAAA.NN BB NNNN-AAAAAAA.NN BB NNNN- AAAAAAA.NN BB NNNN-  tail              tail              tail          tail-             tail-           NNNN BB NN.AAAAAAA  NNNN BBNN.AAAAAAA           tail-             tail-           NNNNNN BB.AAAAAAA  NNNNNN BB .AAAAAAA D = adaptor, G = genomic DNA, A = anchor, B= specified probe base, N = degenerate probe base.

Multiple adaptors also provide the opportunity to further increase thereading capacity and to be able to determine more than 2 bases percycle. In one embodiment, 4-12 bases are identified per cycle. Inanother embodiment, 4-8 bases are identified per cycle. In yet anotherembodiment, 12-16 or more bases are determined per cycle.

In one embodiment, 3 adaptors are positioned 12 bases apart, allowingfor 30 bases of continuous sequence to be obtained by reading 6 bases ateach of 5 ends. In another embodiment, a total of 4 adaptors and reading16 bases between two adaptors generates a continuous sequence of 56bases in 28 cycles. In other embodiments, two (initial plus oneadditional) adaptors separated by 16 bases to read 24 bases are used.

In one embodiment, multiple bases are identified per cycle bysimultaneously hybridizing probes to multiple or all anchor sites withthe same set of 16 dinucleotide probes used at each anchor site but readeach anchor site independently. In one embodiment, this simultaneousprobe ligation is achieved by designing anchors with different meltingtemperatures and measuring color intensities at multiple predefinedtemperatures.

In another embodiment, multiple adaptors are used for cyclical primerextension to provide longer reads with fewer cycles from each individualprimer.

In one embodiment, mapping information can be obtained by scoring asufficient number of short sequences distributed over the entire DNAfragment without any positional information or from a smaller number ofshort sequences at precise locations. A variant of this process isreferred to as “hybridization signature” where expected and observedintensities are compared. In another embodiment, the short sequences maybe designed to provide localized (intermittent or continuous) sequenceinformation. Three examples of such short sequences may be representedschematically as follows:

a. (X)aBB(X)bBB(X)cBB(X)dBB(X)eBB(X)f . . . b1. BBX6BBX4BBX6BBX4BBXa . .. b2. B16Xa

The number of oligonucleotide sequences needed for complete mappinginformation depends on the size of the target sequence, the size of theDNA fragments used and on the complexity of the source DNA. For humanand other similarly complex genomes about 5 positive 8-mers or 10positive 6-mers may be sufficient for 100 base DNA fragments. To scoreone positive 8-mer in 2 cycles, about 10 cycles total can be used byemploying 3-fold more cycles than anchor sequencing. In one embodiment,this process does not utilize insertion of two anchors and may be donewithout enzyme using direct hybridization. In such an embodiment, 30008-mers can be utilized.

In one embodiment the same set of probes may be used in different groupcombinations (combinatorial pooling) to decode which probe from the poolof probes with identical labels is positive. For example, all 3000probes labeled with 300 distinct labels may be scored in two reactionsby having 5 probes labeled with the same probe combination. In additionto 6 true positives, some other 30 or more pool-related false positiveswill be found in these two reactions. By performing another twohybridization cycles where probes will be grouped differently, only truepositive probes will be decoded since they are shared positives betweentwo data sets and with less than one false positive probe being shared.Finding positive probes may be performed by using the lower of the twoscores for each probe. For true positive probes the lower score isexpected to be high. For most negative probes at least one score will bevery low, and so it will cancel one false positive score. This processhelps reduce the number of cycles or number of required labels and mayprovide enough power for many applications without the need to usecombinatorial labeling.

In another embodiment, highly overlapped sets of fragments analyzed inthe form of 2-16 subsets on different subarrays with different subsetsof probes provides a large amount of mapping information. For example250 base fragments starting at every base on average can be analyzed as2-16 subsets with 2-16 different subsets of probes. DNA fragments thatare shifted only 2-26 bases will be analyzed with a few if not all usedprobe subsets providing unique chromosomal identification with at leastone probe subset.

Typically, twenty specific bases will provide the information necessaryfor most unique sequences. In one embodiment, this information can beobtained with two anchors in 5 cycles with 256 tags for reading 5×4bases, or 3 cycles for 24 bases by reading 8 bases per cycle (512tagging combinations). In another embodiment, 3 cycles×6 bases=18 bases(5×3+3 at a distance of 20-30 bases), and in yet another embodiment 4times less tags for 3-mers, may need 3 anchors (3×6+3+3 bases).

In one aspect, a high capacity DNA array platform can be used to analyze100 patient or other DNA samples simultaneously. In the directhybridization (or combinatorial ligation) approach of mapping, only asubset of probes is used and does not provide tag sequenceautomatically. For 4-base tags all 256 probes (e.g. NxUxBBBBUxNx) may beused for mapping or as additional probes. If these probes are also usedfor mapping multiple sets of 256 shifted probes may be needed toidentify the tag sequence.

In one aspect, 5-6 colors are used to decode all 16 dinucleotides andread 2-12 bases in one decoding cycle. In one embodiment, a set of 4tabs is used; in another embodiment, the set is expanded to 6 tags.Multiple decoding cycles alone or in combination with anchors withdifferent melting temperatures can be used to increase the number ofbases that can be read in a single decoding cycle.

In one aspect, 4 bases per ligation cycle are read by testing 2 bases oneach end of an adaptor and by using two corresponding anchors. Bothtypes of probes B2N6-tail and tail-N6B2 may be used simultaneously. Eachprobe type may have unique tails and a matching set of 6 unique tags.Two decoding cycles, using two sets of 6 tags, would identify 4 bases.In 11 ligation cycles 42 continuous and 2 redundant bases would bedetermined. To read a mate-pair of 42+18=60 bases, 15 ligation cycleswould be required.

In another aspect, 8 bases are read per ligation cycle. A total of 4anchors may be used (each of two sides of two adaptors). Probes and tagsmay be the same as in the first option. Thus, in two decoding cycles 2bases on each side of one adaptor can be determined. Because anadditional 2 anchors may be used for the second adaptor, additionalinformation is needed to discriminate which of the two positive 2-mersbelongs to which anchor/adaptor end. This can be achieved by designingthe two anchors for the second adaptor with higher melting temperatures(Tm). Thus, schematically, the 4 anchors are:

     adaptor 1                   adaptor2...GGGGDDDDDDDDDDDDDDDDGGGGGGGGGGGGDDDDDDDDDDDDDDDDDDDDDDDDDDGG GGG...       AAAAAAAA AAAAAAAA           AAAAAAAAAAAAA AAAAAAAAAAAAA D= adaptor bases, G = genomic bases, A = anchor bases,

After two standard cycles of decoding and imaging of 5-6 dyes, astringent wash can be applied that removes low Tm anchors and the tailedprobes that are ligated to them, but does not affect high Tm anchors. Byrepeating two cycles of tag binding and measuring fluorescence, thefluorescence signals specific to the second adaptor with longer (higherTm) anchors is determined. The difference between the first and secondset of measurements gives the signal produced by 2-mers corresponding tothe first adaptor. A strip-off wash at even higher temperature wouldremove higher Tm anchors and free DNA for the next ligation cycle.Higher Tm anchors may be photo, chemically or temperature cleavable foreasy strip-off. To read more bases the process can be repeated 3 timesto read 24 bases surrounding two adaptors, or 6 times to read 48 basessurrounding 4 adaptors. To read the remaining 12 bases for the fifthadaptor, 3 additional cycles may be required. In these 3 cycles, repeatsequencing of 12 previously sequenced bases with the same or shiftedanchor-probe pair may also serve as a control of data quality. In total,9 ligation cycles and 36 decoding cycles can be used to determine 72bases (60 unique and 12 repeated).

In another aspect, 12 bases are read per cycle by expanding the processfrom 2 to 3 levels, providing a read of 12 bases (3×2×2) per ligationcycle. Similarly, 72 bases (60 unique and 12 repeated) can be determinedin just 6 ligation cycles. The Tm approach can be used in many otherconfigurations with an increased number of anchors that can bedifferentially removed one by one. The key advantage of this approach isthat in one ligation reaction, probes of one type are ligated to 3different anchors.

In another aspect, 8 bases are read in one ligation cycle without usingTm differentiation of anchors. To achieve this, the anchor probes aredesigned to read 2 bases simultaneously with a 2 base read by thenon-anchor probes. Two such pairs can be analyzed in one ligation cyclesreading a total of 8 bases per cycle as follows.

DDDDDDDD GGGGGGGGGGGG DDDDDDDDDDDDDDDDDDD GGGGGGGGGGGG DDDDDDDDDtail-AAAAAA BB . BB NNNNNN-TAIL                      TAIL-NNNNNN BB . BBAAAAAA-tail (cycle 1)      tail-AAANNNN BB . BBNNNNNN-TAIL           TAIL-NNNNNN BB . BB NNNNAAA-tail (cycle 2)            tail-NNNNNN BB . BB AAAAAA-TAIL TAIL-AAAAAA BB . BBNNNNNN-tail (cycle 3) D = adaptor bases, G = genomic bases, A = anchorbases, B = specified probe bases, N = degenerate probe bases

Decoding would be performed in four cycles having 4 sets of tagsspecific for each of 4 tail groups. Interestingly, this approach mayprovide 44+20=64 bases using 5 adaptors (8+4×12+8) in 8 ligation cycleswithout generating any redundant base reads. Reading 16 instead of 12bases between two adaptors and a total of 80 bases using 5 adaptors is anatural progression for this system. The main new development that maybe required is to implement a stabilization process for the probe-anchorligation product that is compatible with the encoding tail present atthe anchor probe.

These processes coupled with inserting 1-2 additional adaptors 12 basesapart, can increase parallel reading per ligation cycle from 2 to 8 oreven 12 bases in just 6-15 ligation cycles. In a further embodiment, 16bases are read between neighboring adaptors, allowing the use of onlythe initial+2 inserted adaptors, leading to the ability to determine 40(2×16+8) bases of continuous sequence.

Multiplex Probe-Anchor Ligation Assay

In one aspect, probe sets comprising 16 probes of the structureBBNNNNNN-tail in which the tail is approximately 15 to 20 bases inlength and a complementary tag sequence to the tail labeled withfluorophores are prepared. Tails and tags are designed to minimizeinterference with the analyzed DNA. In one embodiment, tail and tagsequences are prepared from iso-c and iso-g nucleotides to prevent thetag sequence from interacting with the template DNA.

It is possible to test the efficiency of different BBNNNNNN-tail probeswith different tail and tag sequences. Sixteen tail sequences may berequired, but only eight of the 16 probes (with 16 different tails) maybe analyzed in each decoding cycle since the maximum capacity of the4-color mixing is 10 possible combinations of two (not including a nullsignal as a possible probe indicator). Each tail sequence may have thecapacity to bind two tags, and each tag in this design may only have onefluorophore attached. An initial design of a set of 4 tags, one for eachcolor may be performed. The complementary sequences of these tags may becombined to create 8 tails (out of a total of 10 possible combinations).The remaining 8 of the 16 tails may also require an additional set of 4tags but they can carry the same fluorophores as used for the first setof 4 tags.

In one aspect, probes may be prepared with a single fluorophore (e.g.,TAMRA) to determine the relative strengths of the different tagcombinations (i.e. hybrid strengths). Once this information is obtainedit is possible to match the fluorophores to the tags to normalizeintensities. A single fluorophore set of tags can also be used todetermine the relative efficiencies of the BBNNNNNN region of the probewith a common tail structure. Once these parameters have beendetermined, a set of 16 BBNNNNNN-tail probes can be prepared. This probeset may be used to hybridize to RCR products derived from the PCR andsynthetic target circles or even complex genomic samples.

In one embodiment, arrayed RCR targets are first hybridized with anadaptor probe to determine the DNB locations and relative intensities.This probe is removed using standard techniques, such as by raising thetemperature, and a second set of probes can then be hybridized to thearray. The second probe set contains an anchor probe and 16BBNNNNNN-tail probes in a ligation mix. The reaction proceeds for asufficient length of time, preferably for about 30 minutes, and theunligated, unhybridized probes are then washed away. The next additionto the chamber can include the 4 tag probes that hybridize to the tailsof ligated and hybridized BBNNNNNN probes. This hybridization can insome embodiments be as short as 5 minutes to achieve high signalintensities. The chamber is again washed and imaging occurs at thedesired wavelengths. The chamber then undergoes heating to remove thetags but maintain the anchor-BBNNNNNN-tail probes in the hybrid. Thesecond group of 4 tags can then be hybridized to score the presence ofthe second group of 8 BBNNNNNN probes. The level of discriminationbetween the matching BBNNNNNN probe and the other 15 mismatch BBNNNNNNprobes can be determined through the level and combinations of signalintensity.

In one embodiment, to establish a probe-anchor ligation assay, a probeis provided, for example a probe of structure AANNNNNN, to generateenough of a signal for an AATATANN DNA spot with a low ΔG for the TATAsequence. If the signal for the optimal condition is low for some DNAsequences, matching probes can be prepared independently and added intothe mix to selectively boost concentrations only for these probes. If 20sequences out of 256 at the first 4 degenerated positions have to beadjusted, 16×20 additional probes can be prepared.

In one embodiment, development and testing 16 probes for reading 2-basesequences from the other side of the genomic segment between twoadaptors is accomplished. Tail and degenerated bases for these probesmay be at the 5′ end, e.g. Tail-NNNNNNBB.

In one aspect of the invention, the number of dyes that can bedifferentiated is maximized by using multiple specific excitationpatterns and a maximal number of filters for each excitation pattern.For example, 2-4 excitations, each with 4 different wave lengths (totalof 16 wave lengths) can be used in combination with 8-16 filters foreach excitation. Algorithm and software is used to analyze intensitypatterns and deduce the amount of signal from each of the 8-24 dyes.

In one embodiment, direct labeling with dyes is combined with indirectlabeling using haptens (such as biotin) to specifically stain multipleprobes. Directly attached dyes may be photo-bleached or differences inthe intensity may be calculated before and after staining.

In one embodiment, the number of color labels available for use isexpanded by light or chemical de-blocking of quenchers or chemicalmodifications that shift absorption of the given dye. Color intensitiesare measured before and after de-blocking treatment. After the firstimaging is done the dye may be photo-bleached before an increase ofsignal for the given wave length is measured. With multiple types ofquenchers or modifiers (3-4-6) and 8 colors a total of 24-48 noncombinatorial labels can be generated. Combinatorial labeling with 2 outof 24-48 labels gives a potential of 276-1128 two-label combinations.

Long stable anchors provide can improve probe hybridization and ligationto different targets. In one embodiment, the number of degenerate basesis increased to minimize the influence of target sequences that formunstable hybrids such as 5′TATA3′. This may increase the stability ofprobe/target hybrid but a probe that does not have a full match at thefirst 2-4 positions close to the ligation site may hybridize to thetarget and prevent ligation. To minimize this negative influence, oneembodiment provides a higher starting temperature and/or temperaturecycling to increase the number of ligatable probes hybridized next tothe anchor.

Sequencing Using Primer Extension

End sequencing may be performed from one anchor/primer end by manyconsecutive cycles of single base extension using specifically labelednucleotides. In one embodiment, the process includes a step in which thedye or blocker is removed to repeat the extension. Multiple adaptorsprovide increased flexibility in this process. In one embodiment, 2-6 ormore bases are read by single base primer extension using shiftedprimers in consecutive reactions. Multiple simultaneous shifted 0+1 or1+1 primer frames on one adaptor or single frame on multiple adaptors orboth may be used.

In one embodiment, using the initial plus 3 additional anchors provides4 primers. By reading 4 bases of each primer, 16 bases are determined in16 cycles using 4 standard colors, which can be accomplished withoutcombinatorial labeling or tagging. In this embodiment, the primerextension does not have degenerate bases on the labeled component, thusreducing the concentration of dyes used. Because 16 bases may not besufficient for mapping, 4 primers×5-6 bases of extension in 20-24 cyclescan be used.

Multiplex primer extension is possible by discriminative removal of theprimers. Several different methods may be used for such removal based onfactors including: primer length, GC content, base or backbonemodifications such as LNA or PNA, uracil incorporation, or lightsensitive linkage between selected bases. Two to eight stability levelsin one group may be designed. Also 2 to 4 distinct groups that may havedifferent stabilizers or protectors can be used. By applying theselabeling methods, 20-24 bases may be determined in as few as 3-5enzymatic cycles. In another embodiment, a primer protection assay formultiplex primer extension one base at a time is used. In such anembodiment, the primer, for example UUUUUUUNNN, used for the fourthextension provides enough signal because mismatches at NNN can occupyover 50% or over 90% of the target and would not be efficientlyextended. Primer with higher specificity may be created by ligatingUUUUUUU.UUUNNN or UUUUUUU.UNNNNN.

In one aspect, in order to be able to sequence on each side of theanchor, the attached ssDNA may be converted in dsDNA using the attachedprimer and removal of the original strand or primer invasion techniques.One approach to remove the original strand is to incorporate in insertedadaptor binding site for a restriction enzyme that cuts only one strand.The fragmented strand would then be denatured and washed away.

For performing consecutive or overlapped frames or reading 2-3 bases adifferent anchor and or probe design may be used. For example:

Cycle 1: UUUUUUUUUUU.BBNNNNNN Cycle 2: UUUUUUUUUNN.BBNNNNNN orUUUUUUUUUUU.NNBBNNNN Cycle 3: UUUUUUUUUNN.NNBBNNNNWhere U represents common pre-defined bases, a specified base and N adegenerate base

Anchors that have degenerated bases may be designed in two parts toassure preferential binding of anchors that have matching bases atdegenerated positions. Overlapped or shifted frames may be used to readeach base multiple times in the same target. Two examples for multiplereading of the first four bases after the anchor are presented below:

UUUUUUUUUU.UBBNNNNN UUUUUUUUUUU.BBNNNNNN UUUUUUUUUUN.BBNNNNNNUUUUUUUUUUU.NNBBNNNN UUUUUUUUUNN.BBNNNNNN UUUUUUUUUUN.BBNNNNNNWhere U represents common pre-defined bases, B a specified base and N adegenerate base. The ligation site is indicated with a period (.)

Detection Instrumentation

In one aspect of the invention, hardware is provided to allow detectionof the ligation and hybridization events of the sequencing methods. Inone embodiment, the system hardware comprises three major components;the illumination system, the reaction chamber, and the detector system.The detection instrument can include several features such as:adjustable laser power, electronic shutter, auto focus, and operatingsoftware.

Signals from single molecules on random arrays made in accordance withthe invention can generated and detected by a number of detectionsystems, including, but not limited to, scanning electron microscopy,near field scanning optical microscopy (NSOM), total internal reflectionfluorescence microscopy (TIRFM), and the like. Abundant guidance isfound in the literature for applying such techniques for analyzing anddetecting nanoscale structures on surfaces, as evidenced by thefollowing references that are incorporated by reference: Reimer et al,editors, Scanning Electron Microscopy: Physics of Image Formation andMicroanalysis, 2^(nd) Edition (Springer, 1998); Nie et al, Anal. Chem.,78: 1528-1534 (2006); Hecht et al, Journal Chemical Physics, 112:7761-7774 (2000); Zhu et al, editors, Near-Field Optics: Principles andApplications (World Scientific Publishing, Singapore, 1999); Drmanac,International patent publication WO 2004/076683; Lehr et al, Anal.Chem., 75: 2414-2420 (2003); Neuschafer et al, Biosensors &Bioelectronics, 18: 489-497 (2003); Neuschafer et al, U.S. Pat. No.6,289,144; and the like. Of particular interest is TIRFM, for example,as disclosed by Neuschafer et al, U.S. Pat. No. 6,289,144; Lehr et al(cited above); and Drmanac, International patent publication WO2004/076683.

In one aspect, instruments for use with arrays of the invention comprisethree basic components: (i) a fluidics system for storing andtransferring detection and processing reagents, e.g. probes, washsolutions, and the like, to an array; (ii) a reaction chamber, or flowcell, holding or comprising an array and having flow-through andtemperature control capability; and (iii) an illumination and detectionsystem. In one embodiment, a flow cell has a temperature controlsubsystem with ability to maintain temperature in the range from about5-95° C., or more specifically 10-85° C., and can change temperaturewith a rate of about 0.5-2° C. per second.

In one aspect, a flow cell for 1″ square 170 micrometer thick coverslips can be used which have been derivatized to bind macromolecularstructures of the invention. The cell encloses the “array” bysandwiching the glass and a gasket between two planes. One plane has anopening of sufficient size to permit imaging, and an indexing pocket forthe cover slip. The other plane has an indexing pocket for the gasket,fluid ports, and a temperature control system. One fluid port isconnected to a syringe pump which “pulls” or “pushes” fluid from theflow cell the other port is connected to a funnel like mixing chamber.The chamber, in turn is equipped with a liquid level sensor. Thesolutions are dispensed into the funnel, mixed if needed, then drawninto the flow cell. When the level sensor reads air in the funnelsconnection to the flow cell the pump is reversed a known amount to backthe fluid up to the funnel. This prevents air from entering the flowcell. The cover slip surface may be sectioned off and divided intostrips to accommodate fluid flow/capillary effects caused bysandwiching. Such substrate may be housed in an “open air”/“open face”chamber to promote even flow of the buffers over the substrate byeliminating capillary flow effects. Imaging may be accomplished with a100× objective using TIRF or epi illumination and a 1.3 mega pixelHamamatsu orca-er-ag on a Zeiss axiovert 200, or like system. Thisconfiguration images RCR concatemers bound randomly to a substrate(non-ordered array). Imaging speed may be improved by decreasing theobjective magnification power, using grid patterned arrays andincreasing the number of pixels of data collected in each image.

In one embodiment, four or more cameras may be used, preferably in the10-16 megapixel range. Multiple band pass filters and dichroic mirrorsmay also be used to collect pixel data across up to four or moreemission spectra. To compensate for the lower light collecting power ofthe decreased magnification objective, the power of the excitation lightsource can be increased. Throughput can be increased by using one ormore flow chambers with each camera, so that the imaging system is notidle while the samples are being hybridized/reacted. Because the probingof arrays can be non-sequential, more than one imaging system can beused to collect data from a set of arrays, further decreasing assaytime.

During the imaging process, it is preferable that the substrate remainin focus. Some key factors in maintaining focus are the flatness of thesubstrate, orthogonality of the substrate to the focus plane, andmechanical forces on the substrate that may deform it. Substrateflatness can be well-controlled, and glass plates which have better than¼ wave flatness are readily obtained. Uneven mechanical forces on thesubstrate can be minimized through proper design of the hybridizationchamber. Orthogonality to the focus plane can be achieved by a welladjusted, high precision stage. Auto focus routines generally takeadditional time to run, so it is desirable to run them only ifnecessary. In a preferred embodiment, each image is acquired and thenanalyzed using a fast algorithm to determine if the image is in focus.If the image is out of focus, the auto focus routine will be triggered.The system will then store the objectives Z position information to beused upon return to that section of that array during the next imagingcycle. By mapping the objective's Z position at various locations on thesubstrate, it is possible to reduce the time required for substrateimage acquisition.

In one aspect, suitable illumination and detection system forfluorescence-based signal is a Zeiss Axiovert 200 equipped with a TIRFslider coupled to an 80 milliwatt 532 nm solid state laser. The sliderilluminates the substrate through the objective at the correct TIRFillumination angle. TIRF can also be accomplished without the use of theobjective by illuminating the substrate though a prism optically coupledto the substrate. Planar wave guides can also be used to implement TIRFon the substrate Epi illumination can also be employed. The light sourcecan be rastered, spread beam, coherent, incoherent, and originate from asingle or multi-spectrum source.

One embodiment for the imaging system includes a 20× lens with a 1.25 mmfield of view. A 10 megapixel camera is used for detection. Such asystem is able to image approximately 1.5 million concatemers attachedto the patterned array at 1 micron pitch. Under such a configuration,there are approximately 6.4 pixels per concatemer. The number of pixelsper concatemer can be adjusted by increasing or decreasing the field ofview of the objective. For example, a 1 mm field of view yields a valueof 10 pixels per concatemer and a 2 mm field of view yields a value of2.5 pixels per concatemer. The field of view may be adjusted relative tothe magnification and numerical aperture of the objective to yield thelowest pixel count per concatemer that is still capable of beingresolved by the optics, and image analysis software.

Both TIRF and EPI illumination allow for almost any light source to beused. One illumination schema provides a common set of monochromaticillumination sources (about 4 lasers for 6-8 colors) which is sharedamongst imagers. Each imager collects data at a different wavelength atany given time and the light sources would be switched to the imagersvia an optical switching system. In such an embodiment, the illuminationsource preferably produces at least 6, but more preferably 8 differentwavelengths. Such sources include gas lasers, multiple diode pumpedsolid state lasers combined through a fiber coupler, filtered Xenon Arclamps, tunable lasers, or the more novel Spectralum Light Engine, soonto be offered by Tidal Photonics. The Spectralum Light Engine uses prismto spectrally separate light. The spectrum is projected onto a TexasInstruments Digital Light Processor, which can selectively reflect anyportion of the spectrum into a fiber or optical connector. This systemis capable of monitoring and calibrating the power output acrossindividual wavelengths to keep them constant so as to automaticallycompensate for intensity differences as bulbs age or between bulbchanges. The following table represents examples of possible lasers,dyes and filters:

laser excitation filter emission filter Dye 407 nm 405/12 436/12Alexa-405 401/421 407 nm 405/12 546/10 cascade yellow 409/558 488 nm488/10 514/11 Alexa-488 492/517 543 nm 546/10  540/565 Tamra 540/565Bodipy 543 nm 546/10 620/12 577/618 577/618 546/10 620/12 Alexa-594594/613 635 nm 635/11 650/11 Alexa-635 632/647 635 nm 635/11 Alexa700702/723

In one aspect, imaging is accomplished through a 100× objective. Theexcitation light source is an 80 milliwatt diode pumped solid statelaser. This light source has been used successfully with TIRFM and EPIillumination techniques. The images are acquired using a 1.3 mega pixelHamamatsu orca-er-ag camera and a Ziess axiovert 200 invertedmicroscope. This configuration currently images DNBs bound randomly to asubstrate at a 0.5 seconds exposure time.

For handling multiple hybridization cycles a robotic station that isfully integrated with both the reaction chamber and detection system canbe implemented for use with the present invention. Epifluorescence canbe used for detecting greater than 10-20 fluorescent molecules pertarget site. An advantage of using epifluorescence is that it allows theuse of probes of multiple colors with standard microscopes.

In one aspect, a two piece flow cell is used to house a 1″ square, 170μm thick cover slip, which has been derivatized and activated to bindDNBs. A side port is connected to a syringe pump that “pulls” or“pushes” fluid from the flow cell. A second port is connected to afunnel like mixing chamber that is equipped with a liquid level sensor.The solutions are dispensed into the mixing chamber, mixed if needed,then drawn into the flow cell. When the level sensor detects air in thefunnel's connection to the flow cell, the pump is reversed a knownamount to back the fluid up to the funnel. This prevents air fromentering the flow cell. This chamber has worked well for cover slipsized substrates and may be used in modified form for the largersubstrates. Such a three-axis robotic gantry pipetting system integratedwith the hybridization chamber and imaging subsystem can befunctionalized for fully automated probe pipetting.

Fiducials

In one embodiment, the regular pattern of capture cells is interruptedin such a way as to encode location information into each acquiredimage. Approximately 1000 cells per image can be removed from thepattern to create a 10 bit code, which would represent up to 1024 namedlocations on each substrate (FIG. 5).

The physical features of the coding region can be used as a reference tolocate all pixels in the image during image analysis, while the codeitself is used to verify that the instrument imaged the correct area ofthe substrate. A key feature of the coding region is that each elementis represented by a no-binding spots “empty area” block. This eliminatesthe need for fluorescent markers on the substrate. RCR products whichare positive for a given probe-set define each element's borders. Thismeans that the region would still be recognizable even if only 5% to 10%of RCR products bound to the surface are positive for a given probepool. In one embodiment, the code is readable if each coding elementrepresents 50 capture cells

Kits of the Invention

In the commercialization of the methods described herein, certain kitsfor construction of random arrays of the invention and for using thesame for various applications are particularly useful. Kits forapplications of random arrays of the invention include, but are notlimited to, kits for determining the nucleotide sequence of targetpolynucleotides. A kit typically comprises at least one support having asurface and one or more reagents necessary or useful for constructing arandom array of the invention or for carrying out an applicationtherewith. Such reagents include, without limitation, nucleic acidprimers, probes, adaptors, enzymes, and the like, and are each packagedin a container, such as, without limitation, a vial, tube or bottle, ina package suitable for commercial distribution, such as, withoutlimitation, a box, a sealed pouch, a blister pack and a carton.

The package typically contains a label or packaging insert indicatingthe uses of the packaged materials. As used herein, “packagingmaterials” includes any article used in the packaging for distributionof reagents in a kit, including without limitation containers, vials,tubes, bottles, pouches, blister packaging, labels, tags, instructionsheets and package inserts.

In another aspect the invention provides kits for sequencing a targetpolynucleotide comprising the following components: (i) a support havinga planar surface having an array of optically resolvable discrete spacedapart regions, wherein each discrete spaced apart region has an area ofless than 1 μm²; (ii) a first set of probes for hybridizing to aplurality of concatemers randomly disposed on the discrete spaced apartregions, the concatemers each containing multiple copies of a DNAfragment of the target polynucleotide; and (iii) a second set of probesfor hybridizing to the plurality of concatemers such that whenever aprobe from the first set hybridizes contiguously to a probe from thesecond set, the probes are ligated. Such kits may further include aligase, a ligase buffer, and a hybridization buffer. In someembodiments, the discrete spaced apart regions may have captureoligonucleotides attached and the concatemers may each have a regioncomplementary to the capture oligonucleotides such that said concatemersare capable of being attached to the discrete spaced apart regions byformation of complexes between the capture oligonucleotides and thecomplementary regions of said concatemers.

In another aspect, the invention includes kits for circularizing DNAfragments. In an exemplary embodiment, such a kit includes thecomponents: (a) at least one adaptor oligonucleotide for ligating to oneor more DNA fragments and forming DNA circles therewith (b) a terminaltransferase for attaching a homopolymer tail to said DNA fragments toprovide a binding site for a first end of said adaptor oligonucleotide,(c) a ligase for ligating a strand of said adaptor oligonucleotide toends of said DNA fragment to form said DNA circle, (d) a primer forannealing to a region of the strand of said adaptor oligonucleotide, and(e) a DNA polymerase for extending the primer annealed to the strand ina rolling circle replication reaction. In a further embodiment, theabove adaptor oligonucleotide may have a second end having a number ofdegenerate bases in the range of from 4 to 12. The above kit may furtherinclude reaction buffers for the terminal transferase, ligase, and DNApolymerase.

In still another aspect, the invention includes a kit for circularizingDNA fragments using a CircLigase™ enzyme (Epicentre Biotechnologies,Madison, Wis.), which kit comprises a volume exclusion polymer. In afurther embodiment, the kit includes the following components: (a)reaction buffer for controlling pH and providing an optimized saltcomposition for CircLigase, and (b) CircLigase cofactors. In anotheraspect, a reaction buffer for such kit comprises 0.5 M MOPS (pH 7.5),0.1 M KCl, 50 mM MgCl₂, and 10 mM DTT. In another aspect, such kitincludes CircLigase, e.g. 10-100 μL CircLigase solution (at 100unit/μL). Exemplary volume exclusion polymers are disclosed in U.S. Pat.No. 4,886,741, which is incorporated by reference, and includepolyethylene glycol, polyvinylpyrrolidone, dextran sulfate, and likepolymers. In one aspect, polyethylene glycol (PEG) is 50% PEG4000. Inone aspect, a kit for circle formation includes the following:

Amount Component Final Conc. 2 μL CircLigase ™ 10× reaction buffer 1X0.5 μL   1 mM ATP 25 μM 0.5 μL   50 mM MnCl₂ 1.25 mM 4 μL 50% PEG400010% 2 μL CircLigase ™ ssDNA ligase (100 units/μL) 10 units/μL singlestranded DNA template 0.5-10 pmol/μL sterile water Final reactionvolume: 20 μL.

The above components can be used in a number of different protocolsknown in the art, for example: (1) Heat DNA at 60-96° C. depending onthe length of the DNA (ssDNA templates that have a 5′-phosphate and a3′-hydroxyl group); (2) Preheat 2.2× reaction mix at 60° C. for about5-10 min; (3) If DNA was preheated to 96° C. cool it down at 60° C. MixDNA and buffer at 60° C. without cooling it down and incubate for 2-3 h;(4) Heat-inactivate enzyme to stop the ligation reaction.

The present invention may be better understood by reference to thefollowing non-limiting Examples, which are provided as exemplary of theinvention. The following examples are presented in order to more fullyillustrate preferred embodiments of the invention, but should in no waybe construed as limiting the broad scope of the invention.

EXAMPLES Example 1 RCR Based Formation and Attachment of DNBs

Two synthetic targets were co-amplified. About one million moleculeswere captured on the glass surface, and then probed for one of thetargets. After imaging and photo-bleaching the first probe, the secondtarget was probed. Successive hybridization with amplicon specificprobes showed that each spot on the array corresponded uniquely toeither one of the two amplicon sequences. It was also confirmed that theprobe could be removed through heating to 70° C. and then re-hybridizedto produce equally strong signals.

Example 2 Validation of Circle Formation and Amplification

The circle formation and amplification process was validated using E.coli DNA (FIG. 6). A universal adaptor, which also served as the bindingsite for capture probes and RCR primer, was ligated to the 5′ end of thetarget molecule using a universal template DNA containing degeneratebases for binding to all genomic sequences. The 3′ end of the targetmolecule was modified by addition of a poly-dA tail using terminaltransferase. The modified target was then circularized using a bridgingtemplate complementary to the adaptor and to the oligo-dA tail.

Example 3 Validation of Ligation with Condensed Concatemers

The ability for probe ligation to occur with the condensed concatemerswas tested. Reactions were carried out at 20° C. for 10 min usingligase, followed by a brief wash of the chamber to remove excess probes.The ligation of a 6-mer and a labeled 5-mer produced signal levelscomparable to that of an 1-mer. Software modules, including imageanalysis of random arrays, were tested on simulated data for wholegenome sequence reconstruction.

Example 4 Identification of Targets from Multiple Pathogens Using aSingle Array

PCR products from diagnostic regions of Bacillus anthracis and Yersiniapestis were converted into single stranded DNA and attached to auniversal adaptor. These 2 samples were then mixed and replicatedtogether using RCR and deposited onto the chip surface as a randomarray. Successive hybridization with amplicon specific probes showedthat each spot on the array corresponded uniquely to either one of thetwo amplicon sequences and that they can be identified specifically withthe probes (FIG. 7), thus demonstrating sensitivity and specificity foridentifying DNA present in submicron size DNA nano-balls having about100-1000 copies of a DNA fragment generated by the RCR reaction.

A 155 bp amplicon sequence from B. anthracis and a 275 bp ampliconsequence from Y. pestis were amplified using standard PCR techniqueswith PCR primers in which one primer of the pair was phosphorylated. Asingle stranded form of the PCR products was generated by degradation ofthe phosphorylated strand using lambda exonuclease. The 5′ end of theremaining strand was then phosphorylated using T4 DNA polynucleotidekinase to allow ligation of the single stranded product to the universaladaptor. The universal adaptor was ligated using T4 DNA ligase to the 5′end of the target molecule, assisted by a template oligonucleotidecomplementary to the 5′ end of the targets and 3′ end of the universaladaptor. The adaptor ligated targets were then circularized usingbridging oligonucleotides with bases complementary to the adaptor and tothe 3′ end of the targets. Linear DNA molecules were removed by treatingwith exonuclease I. RCR were generated by mixing the single-strandedsamples and using Phi29 polymerase to replicate around the circularizedadaptor-target molecules with the bridging oligonucleotides as theinitiating primers. The RCR products were captured on the glass slidevia the capture oligonucleotide, which was attached to derivatized glasscoverslips and was complementary to the universal adaptor sequence.

Arrayed target nano-ball molecules derived from B. anthracis and Y.pestis PCR amplicons were probed sequentially with TAMRA-labeled 11-merprobes complementary to the universal adaptor sequence, or 11-mer probescomplementary to one of the two amplicon sequences By overlaying theimages obtained from successive hybridization of 3 probes, (FIG. 7) itcan be seen that most of the arrayed molecules that hybridized with theadaptor probe (blue spots) would only hybridize to either the amplicon 1probe (red spots) or the amplicon 2 probe (green spots), with very fewthat would hybridize to both. This specific hybridization patterndemonstrated that each spot on the array contained only one type ofsequence, either the B anthracis amplicon or the Y. pestis amplicon. Italso demonstrated that the rSBH process was able to distinguish targetmolecules of different sequences deposited onto the array by usingsequence specific probes.

Example 5 Decoding Base Position in Arrayed DNBs Created from 80-merOligonucleotide with Degenerate Bases

Individual molecules of a synthetic oligonucleotide containing adegenerate base were divided into 4 sub-populations, each having eitheran A, C, G or T base at that particular position. An array of DNBscreated from this synthetic DNA can have about 25% of spots with each ofthe bases. Four successive hybridization and ligation of pairs of probesspecific to each of the 4 bases identified the sub-populations (FIG. 8).

A 5′ phosphorylated, 3′ TAMRA-labeled pentamer oligonucleotide waspaired with one of the four hexamer oligonucleotides. Each of these 4ligation probe pairs hybridize to either an A, C, G or T-containingversion of the target. Discrimination scores of greater than 3 wereobtained for most targets, demonstrating the ability to identify singlebase differences between the nanoball targets. The discrimination scoreis the highest spot score divided by the average of the other 3base-specific signals of the same spot. Adjusting the assay conditions(buffer composition, concentrations of all components, time andtemperature of each step in the cycle) can result in higher signal tobackground allowing for calculation of full match to mismatch ratios.

A similar ligation assay was performed on the spotted arrays of 6-merprobes. In this case full-match/background ratio was about 50 and theaverage full match/mismatch ratio was 30. The results furtherdemonstrated the ability to determine partial or complete sequences ofDNA present in DNBs by increasing the number of consecutive probe cyclesor by using 4 or more probes labeled with different dyes per each cycle.

To identify the sub-populations, a set of 4 ligation probes specific toeach of the 4 bases was used. A 5′ phosphorylated, 3′ TAMRA-labeledpentamer oligonucleotide corresponding to position 33-37 of T1A withsequence CAAAC (probe T1A9b) was paired with one of the followinghexamer oligonucleotides corresponding to position 27-32: ACTGTA (probeT1A9a), ACTGTC (probe T1A10a), ACTGTG (probe T1A11a), ACTGTT (probeT1A12a). Each of these 4 ligation probe pairs should hybridize to eitheran A, C, G or T containing version of T1A. For each hybridization cycle,the probes were incubated with the array in a ligation/hybridizationbuffer containing T4 DNA ligase at 20° C. for 5 minutes. Excess probeswere washed off at 20° C. and images were taken with the TIRFmicroscope. Bound probes were stripped to prepare for the next round ofhybridization.

An adaptor specific probe (BrPrb3) was hybridized to the array toestablish the positions of all the spots (FIG. 8). The 4 ligation probepairs, at 0.4 μM, were then hybridized successively to the array: thespots hybridized to the A-specific ligation probe pair are shown as redin FIG. 5, the C-specific spots are green, G-specific spots are yellowand the T-specific spots are cyan. In FIG. 5, circle A indicates theposition of one of the spots hybridized to both the adaptor probe andthe A-specific ligation probe pair, suggesting that the DNA arrayed atthis spot is derived from a molecule of T1A that contains an A atposition 32. It is clear that most of the spots associated with only oneof the 4 ligation probe pairs, allowing identification of the base atposition 32 to be determined specifically.

Using an in-house image analysis program, spots were identified usingthe images taken for the hybridization cycle using the adaptor probe.The same spots were also identified, and the fluorescent signals werequantified for subsequent cycles, with the base-specific ligationprobes. A discrimination score was calculated for each signal for eachbase-specific signal of each spot. The discrimination score is the spotscore divided by the average of the other 3 base-specific signals of thesame spot. For each spot, the highest of the 4 base-specificdiscrimination scores was compared with the second highest score. If theratio of the two was above 1.8, then the base corresponding to themaximum discrimination score was selected for the base calling. In thisanalysis over 500 spots were successfully base-called and the averagediscrimination score was 3.34. The average full match signal was 272,while the average single mismatch signal (signals from the un-selectedbases) was 83.2. Thus the full match/mismatch ratio was 3.27. The imagebackground noise was calculated by quantifying signals from randomlyselected empty spots and the average signal of these empty spots was82.9. Thus the full match/background noise ratio was 3.28. In theseexperiments the mismatch discrimination was limited by the low fullmatch signal relative to the background.

Example 6 Decoding 2 Degenerate Bases at the End of a Synthetic 80-merOligonucleotide Using a Probe-Anchor Ligation Assay

A synthetic oligonucleotide containing 8 degenerate bases at the 5′ endwas used to simulate random genomic DNA ends. The DNA-nanoballs createdfrom this oligonucleotide will have these 8 degenerate bases placeddirectly next to the adaptor sequence. To demonstrate the feasibility ofsequencing the 2 unknown bases adjacent to the known adaptor sequenceusing a probe-anchor ligation approach, a 12-mer oligonucleotide with aspecific sequence to hybridize to the 3′ end of the adaptor sequence wasused as the anchor, and a set of 16 TAMRA-labeled oligonucleotides inthe form of BBNNNNNN were used as the sequence-reading probes.

Using a subset of the BBNNNNNN probe set (namely GA, GC, GG and GT inthe place of BB), spots could be identified on the nano-ball arraycreated from targets that specifically bind to one of these 4 probes,with an average full match/mismatch ratio of over 20 (FIG. 9).

Example 7 Producing Structured Nano-Ball Arrays

Ordered array lines of capture probe separated on average by 5 um wereprepared. Lines were produced by using a pulled glass capillary beveledat 45 degrees to a tip size of 5 μm, loaded with 1 μl of 5 μM captureprobe in water, and drawn across the glass slide by a precision gantryrobot. DNBs were allowed to attach to the surface of the coverslip andthen detected with a probe specific for the adaptor. FIG. 10 shows thehigh density attachment to regions where a capture probe was depositedon the surface, indicating that DNBs can be arranged in a grid if asubstrate with submicron binding sites is prepared.

Example 8 Demonstrating Circle Formation with Multiple Adaptors

A synthetic target DNA of 70 bases in length and a PCR derived fragmentof 200-300 bp in length was obtained from a double stranded product byphosphorylation of one of the primers and treatment with lambdaexonuclease to remove the phosphorylated strand. The single strandedfragment was ligated to an adaptor for circularization. Polymerization,type IIs restriction enzyme digestion and re-ligation with a new adaptorwas performed as described herein.

Demonstration that the process was successful was accomplished using RCRamplification of the final derived circles. Briefly, the DNA circleswere incubated with primer complementary to the last introduced adaptorand Phi29 polymerase for 1 hour at 30° C. to generate a singleconcatemer molecule consisting of hundreds of repeated copies of theoriginal DNA circle. Attachment of the RCR products to the surface ofcoverslips could also be accomplished by utilizing an adaptor sequencein the concatemer that is complementary to an attached oligonucleotideon the surface. Hybridization of adaptor unique probes was used todemonstrate that the individual adaptors were incorporated into thecircle and ultimately the RCR product. To demonstrate that the adaptorswere incorporated at the expected positions within the circle, sequencespecific probes (labeled 5-mers) were used for the synthetic or PCRderived sequence such that ligation may occur to an unlabeled anchorprobe that recognizes the terminal sequence of the adaptor. Cloning andsequencing were also used to verify DNA integrity. The process wassimplified by generating clean ssDNA after each circle cutting whichallowed the use of the same circle closing chemistry for each of theadaptor incorporations.

1-42. (canceled)
 43. A method of inserting multiple adaptors in a targetnucleic acid comprising: a) ligating a first adaptor to one terminus ofsaid target nucleic acid to form a first linear construct, wherein saidadaptor comprises a recognition site for a first restriction enzyme; b)circularizing first linear construct to create a first circularpolynucleotide; c) cleaving said first circular polynucleotide with saidfirst restriction enzyme to form a second linear construct, wherein saidfirst restriction enzyme is able to bind to said recognition site withinsaid first adaptor; d) ligating a second adaptor to said second linearconstruct to form a third linear construct, wherein said second adaptorcomprises a recognition site for a second restriction enzyme; e)circularizing said third linear construct to create a second circularpolynucleotide. 44-82. (canceled)